chapter 5.pptx: drainage and irrigation engineering
Â
CS8091_BDA_Unit_I_Analytical_Architecture
1. CS8091 / Big Data Analytics
III Year / VI Semester
2. Objectives
ï To study the basic Big Data and analytics concepts,
HDFS and MapReduce.
ï To learn the fundamentals of Clustering and
Classification.
ï To understand the fundamental concepts of
Association and different types of Recommendation
System.
ï To learn about stream computing and study various
case studies.
ï To have an introductory knowledge about NoSQL
Data Management and Visualization.
3. Unit I - INTRODUCTION TO BIG
DATA
Evolution of Big data - Best Practices for Big data
Analytics - Big data characteristics - Validating - The
Promotion of the Value of Big Data - Big Data Use
Cases- Characteristics of Big Data Applications -
Perception and Quantification of Value -Understanding
Big Data Storage - A General Overview of High-
Performance Architecture - HDFS - MapReduce and
YARN - Map Reduce Programming Model.
4. DATA
The quantities, characters, or symbols on which
operations are performed by a computer, which may be stored
and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
5. BIG DATA
Big Data is also data but with a huge size. Big Data is a
term used to describe a collection of data that is huge in size and
yet growing exponentially with time.
âBig Dataâ is data whose scale, diversity, and complexity require
new architecture, techniques, algorithms, and analytics to
manage it and extract value and hidden knowledge from itâŠ
6. BIG DATA
Units of Memory-
ï Byte
ï Kilo Byte
ï Mega Byte
ï Giga Byte
ï Tera Byte
ï Peta Byte
ï Exa Byte
ï Zetta Byte
ï Yotta Byte
8. BIG DATA - Sources
ï Primary sources of Big Data
ï Social data:
ï Likes,
ïTweets & Retweets,
ïComments,
ïVideo Uploads, and general media
9. BIG DATA - Sources
ï Primary sources of Big Data
ï Machine data:
ï Industrial equipment,
ï sensors that are installed in machinery,
ï web logs which track user behavior
ï Sensors such as medical devices, smart meters,
road cameras, satellites, games
10. BIG DATA - Sources
ï Primary sources of Big Data
ï Transactional data:
ï Invoices,
ï Payment orders,
ï Storage records,
ï Delivery receipts
12. BIG DATA â Data Structures
ïStructured data:
ï Data containing a defined data type, format, and
structure.
13. BIG DATA â Data Structures
ïSemi-structured data:
ï Semi-structured data is information that does not
reside in a relational database but that have some
organizational properties that make it easier to
analyze.
ï Example: XML Data
14. BIG DATA â Data Structures
ïQuasi-structured data:
ï It consists of textual data with erratic data formats,
and can be formatted with effort, software tools,
and time. An example of quasi-structured data is
the data about which webpages a user visited and
in what order.
15. BIG DATA â Data Structures
ïQuasi-structured data:
16. BIG DATA â Data Structures
ïUnstructured data:
ïData that has no inherent structure, which may
include text documents, PDFs, images, and video.
17. BIG DATA â Data Structures
ïA clickstream that can be parsed and mined by
data scientists to discover usage patterns and
uncover relationships among clicks and areas
of interest on a website or group of sites.
18. Types of Data Repositories, from an
Analyst Perspective
Data Repository Characteristics
Spreadsheets and
data marts
Spreadsheets and low-volume
databases for recordkeeping
Analyst depends on data extracts
19. Types of Data Repositories, from an
Analyst Perspective
Data Repository Characteristics
Data Warehouses Centralized data containers in a purpose-built
space
Supports BI and reporting, but restricts robust
analyses
Analyst dependent on IT and DBAs for data access
and schema changes
Analysts must spend significant time to get
aggregated and disaggregated data extracts from
multiple sources.
20. Types of Data Repositories, from an
Analyst Perspective
Data Repository Characteristics
Analytic Sandbox
(workspaces)
Data assets gathered from multiple sources and
technologies for analysis
Enables flexible, high-performance analysis in a
nonproduction environment; can leverage in-
database processing
Reduces costs and risks associated with data
replication into âshadowâ file systems
âAnalyst ownedâ rather than âDBA ownedâ
21. State of the Practice in Analytics
Business Driver Examples
Optimize business operations Sales, pricing, profitability,
efficiency
Identify business risk Customer churn, fraud, default
Predict new business
opportunities
Upsell, cross-sell, best new
customer prospects
Comply with laws or regulatory
requirements
Anti-Money Laundering, Fair
Lending, Basel II-III,
Sarbanes-Oxley (SOX)
23. BI Versus Data Science
ïBI systems make it easy to answer questions
related to:
ïQuarter-to-date revenue,
ïProgress toward quarterly targets, and
ïUnderstand how much of a given product was sold
in a prior quarter or year
24. BI Versus Data Science
ïData Science tends to use disaggregated data
in a
ïmore forward-looking,
ïexploratory way,
ïfocusing on analyzing the present and enabling
informed decisions about the future.
25. BI Versus Data Science
ïBI problems tend to require highly structured
data organized in rows and columns for
accurate reporting,
ï Data Science projects tend to use many types
of data sources, including large or
unconventional datasets
26. Current Analytical Architecture
ïMost organizations still have data warehouses
that provide excellent support for traditional
reporting and
ïsimple data analysis activities but
unfortunately have a more difficult time
supporting more robust analyses.
28. Current Analytical Architecture
ï For data sources to be loaded into the data
warehouse, data needs to be well understood,
structured, and normalized with the
appropriate data type definitions.
29. Current Analytical Architecture
ï Although this kind of centralization enables
security, backup, and failover of highly critical
data,
ïit also means that data typically must go through
significant preprocessing and checkpoints before
it can enter this sort of controlled environment
30. Current Analytical Architecture
ï As a result of this level of control on the
EDW, additional local systems may emerge in
the form of departmental warehouses and local
data marts that business users create to
accommodate their need for flexible analysis.
31. Current Analytical Architecture
ï Once in the data warehouse, data is read by
additional applications across the enterprise for
BI and reporting purposes.
ï These are high-priority operational processes
getting critical data feeds from the data
warehouses and repositories
32. Current Analytical Architecture
ï Analysts create data extracts from the EDW to
analyze data offline in R or other local
analytical tools.
33. Current Analytical Architecture
ï Because new data sources slowly accumulate
in the EDW due to the rigorous validation and
data structuring process, data is slow to move
into the EDW, and the data schema is slow to
change.
34. Current Analytical Architecture
ï Departmental data warehouses may have been
originally designed for a specific purpose and
set of business needs, some of which may be
forced into existing schemas to enable BI and
the creation of OLAP cubes for analysis and
reporting.
36. Drivers of Big Data
ïThe data now comes from multiple sources,
such as these:
ï Medical information, such as genomic sequencing
and diagnostic imaging
ïPhotos and video footage uploaded to the World
Wide Web.
37. Drivers of Big Data
ïThe data now comes from multiple sources,
such as these:
ï Video surveillance, such as the thousands of video
cameras spread across a city
ïMobile devices, which provide geospatial location
data of the users, as well as metadata about text
messages, phone calls, and application usage on
smart phones.
38. Drivers of Big Data
ïThe data now comes from multiple sources,
such as these:
ï Smart devices, which provide sensor-based
collection of information from smart electric grids,
smart buildings, and many other public and
industry infrastructures
ïNontraditional IT devices, including the use of
radio-frequency identification (RFID) readers,
GPS navigation systems, and seismic processing.
40. Emerging Big Data Ecosystem and a
New Approach to Analytics
ï Data devices
ï âSensornetâ gather data from multiple locations
and continuously generate new data about this
data.
ï The video game provider captures data about the
skill and levels attained by the player.
41. Emerging Big Data Ecosystem and a
New Approach to Analytics
ï Data devices
ï As a consequence, the game provider can fine-
tune the difficulty of the game, suggest other
related games that would most likely interest the
user, and offer additional equipment and
enhancements for the character based on the userâs
age, gender, and interests.
42. Emerging Big Data Ecosystem and a
New Approach to Analytics
ï Data collectors
ï Retail stores tracking the path a customer takes
through their store while pushing a shopping cart
with an RFID chip so they can gauge which
products get the most foot traffic using geospatial
data collected from the RFID chips
43. Emerging Big Data Ecosystem and a
New Approach to Analytics
ï Data aggregators
ï Organizations compile data from the devices and usage
patterns collected by government agencies, retail stores,
and websites.
ïIn turn, they can choose to transform and package the data
as products to sell to list brokers, who may want to
generate marketing lists of people who may be good targets
for specific ad campaigns.
44. Emerging Big Data Ecosystem and a
New Approach to Analytics
ï Data users and buyers
ï These groups directly benefit from the data collected and
aggregated by others within the data value chain.