2. Agenda
• What is Big Data?
• Concepts and Terminology
• Big Data Characteristics
• Different Types of Data
• Case Study Background
• Marketplace Dynamics
• Business Architecture
• Information and Communications
Technology
2
29. M2M Data
• Data generated by different sources
around us like automated systems, sensors
and mobile devices.
• 2.5 quintillion bytes of data created
everyday.
• 80-90% of the data in the world today has
been created in the last two years alone.
29
30. Flood of Data
• More than 4.5 billion internet users in the
world today.
• The New York Stock Exchange generates
about 4-5 TB of data per day.
• 7TB of data are processed by Twitter every
day.
• 10TB of data are processed by Facebook
every day and growing at 7 PB per month. 30
33. Flood of Data (cont’d)
• Interestingly 80% of these data are
unstructured.
• With this massive quantity of data,
businesses need fast, reliable, deeper data
insight.
• Therefore, Big Data solutions based on
Hadoop and other analytics software are
becoming more and more relevant.
33
35. Handling Humongous Data
• Traditional approaches not fit for data
analysis due to inflation.
• Handling Large volume of data which are
structured or unstructured.
• Datasets that grow so large that it is
difficult to capture, store, manage, share,
analyze and visualize with the typical
database software tools.
35
37. Big Data Analytic Applications
• Analysis of market and derive new strategy
to improve business in different geo
locations.
• To know the response for their campaigns,
promotions, and other advertising
mediums.
• Use medical history of patients, hospitals
to provide better and quick service.
37
38. Big Data Analytic Applications
• Perform Risk Analysis.
• Create new revenue streams.
• Reduces maintenance cost.
• Faster, better decision making.
• New products & services.
• Etc
38
39. Data Science as Tool
• Involves using methods to analyze massive
amounts of data and extract the
knowledge it contains.
• Data science and big data evolved from
statistics and traditional data management
but are now considered to be distinct
disciplines.
39
41. Data Science Processes
1. Setting the research goal
2. Retrieving data
3. Cleansing, integrating, and transforming
data
4. Exploratory data analysis
5. Building model(s)
6. Presenting of finding (insights)
41
43. Datasets
• Collections or groups of related data are
generally referred to as datasets.
• Each group or dataset member (datum)
shares the same set of attributes or
properties as others in the same dataset.
43
44. Example of Datasets
• Tweets stored in a flat file
• A collection of image files in a directory
• An extract of rows from a database table
stored in a CSV formatted file
• Historical weather observations that are
stored as XML files
44
45. Data Analysis
• Process of examining data to
find facts, relationships,
patterns, insights and/or
trends.
• Goal: to support better
decision making
• Help establish patterns and
relationships among the
data being analyzed 45
46. Data Analytics
• Discipline that includes the management
of the complete data lifecycle, which
encompasses collecting, cleansing,
organizing, storing, analyzing and
governing data.
• Involves both development of analysis
methods and scientific technique and
automated tools.
46
47. Data Analytics
• Developed methods that allow data
analysis to occur through the use of highly
scalable distributed technologies and
frameworks that are capable of analyzing
large volumes of data from different
sources.
• Enable data-driven decision-making with
scientific backing so that decisions can be
based on factual data and not simply on
past experience or intuition alone. 47
48. Data Analytics Categories
There are four general categories of analytics
that are distinguished by the results they
produce:
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
48
49. Descriptive Analytics
• Carried out to answer questions about events
that have already occurred.
• Contextualizes data to generate information.
• Often carried out via ad-hoc reporting or
dashboards.
• The reports are generally
static in nature and
display historical data that
is presented in the form
of data grids or charts.
49
50. Diagnostic Analytics
• Determine the cause of a phenomenon that
occurred in the past using questions that focus
on the reason behind the event.
• Require collecting data from multiple sources
and storing it in a structure
• To performing
drill-down and
roll-up analysis.
50
51. Predictive Analytics
• Carried out in an attempt to determine the
outcome of an event that might occur in the
future.
• The models used for predictive analytics have
implicit dependencies on the conditions under
which the past events occurred.
51
52. Prescriptive Analytics
• Build upon the results
of predictive analytics
by prescribing actions
that should be taken.
• The focus is not only on
which prescribed
option is best to follow,
but why.
• For management an
advantage or mitigate a
risk. 52
54. Business Intelligence (BI)
54
• Enables an organization to gain insight into
the performance of an enterprise using
analyzed data.
• The analyzed data is generated by its business
processes and information systems.
• The results of the analysis can be used by
management to steer the business in an
effort to correct detected issues or otherwise
enhance organizational performance.
55. Business Intelligence (BI)
• BI applies analytics to large amounts of
data across the enterprise, which has
typically been consolidated into an
enterprise data warehouse to run
analytical queries.
55
56. Business Intelligence (BI)
• The output of BI can be surfaced to a
dashboard
• Allows managers to access and analyze the
results
• And to potentially refine the analytic
queries to further explore the data.
56
57. Key Performance Indices
(KPIs)
• A metric that can be used to gauge
success within a particular business
context.
• Linked with an enterprise’s overall
strategic goals and objectives.
• Often used to identify business
performance problems and demonstrate
regulatory compliance. 57
58. Key Performance Indices
(KPIs)
• Act as quantifiable reference points for
measuring a specific aspect of a business’
overall performance.
58
60. Big Data Definition
• For someone, it is a buzzword that is trying
to address all this “new” needing of
processing a lot of data.
• Usually use the “Three V” to define Big
Data
60
61. Volume
• The anticipated volume of data that is
processed by Big Data solutions is substantial
and ever-growing.
• High data volumes
impose distinct data
storage and processing
demands, as well as
additional data
preparation, curation
and management
processes.
61
62. Velocity
• In Big Data environments, data
can arrive at fast speeds.
• Enormous datasets can
accumulate within very short
periods of time.
• Coping with the fast inflow of
data requires the enterprise to
design highly elastic and
available data processing
solutions and corresponding
data storage capabilities 62
63. Variety
• The multiple formats and types of data that
need to be supported by Big Data solutions.
• Data variety brings challenges for enterprises
in terms of data integration, transformation,
processing, and storage.
63
66. Veracity
• Veracity refers to the quality or fidelity of
data.
• Data that enters Big Data environments
needs to be assessed for quality, which can
lead to data processing activities to resolve
invalid data and remove noise.
• Noise is data that cannot be converted
into information and thus has no value,
whereas signals have value and lead to
meaningful information 66
70. Value
• Value is defined as the usefulness of data for
an enterprise.
• Value is also dependent on how long data
processing takes.
• The longer it takes for data to be turned into
meaningful information, the less value it has
for a business.
70
73. Structured Data
• Conforms to a data model or schema and
is often stored in tabular form.
• Used to capture relationships between
different entities and is therefore most
often stored in a relational database.
• Frequently generated by enterprise
applications and information systems like
ERP and CRM systems.
• Rarely requires special consideration in
regards to processing or storage.
73
74. Unstructured Data
• Data that does not conform to a data model
or data schema is known as unstructured
data.
• It is estimated that unstructured data makes
up 80% of the data within any given
enterprise.
• Unstructured data has a faster growth rate
than structured data.
• This form of data is either textual or binary
and often conveyed via files that are self-
contained and non-relational. 74
75. Semi-structured Data
• Has a defined level of structure and
consistency that is not relational in nature but
is hierarchical or graph-based.
• This kind of data is commonly stored in files
that contain text.
• It conforms to some level of structure, it is
more easily processed than unstructured
data.
• Often has special pre-processing and storage
requirements, especially if the underlying
format is not text-based. 75
76. Unstructured Data
• Machine Generated
• Satellite images
• Scientific data
• Photographs and video
• Radar or sonar data
• Human Generated
• Word, PDF, Text
• Social media data (Facebook, Twitter, LinkedIn)
• Mobile data (text messages)
• website contents (blogs, Instagram)
76
77. Metadata
• Provides information about a dataset’s
characteristics and structure.
• Mostly machine-generated and can be
appended to data.
• The tracking of metadata is crucial to Big
Data processing, storage and analysis – it
provides information about the pedigree
of the data and its provenance during
processing. 77
80. Overview
• Businesses entrenched and worked to
improve their efficiency and effectiveness
to stabilize their profitability by reducing
costs.
• Companies began to focus outward,
looking to find new customers and keep
existing customers from defecting.
• They offer new products and services and
delivering increased value propositions to
customers. 80
81. External Data
• Companies need to expand their Business
Intelligence activities beyond retrospection
on extracted internal information.
• Open themselves to external data sources as
a means of sensing the marketplace and their
position within it.
• External data could brings additional context
to their internal data
• Allows a corporation to move up the analytic
value chain from hindsight to insight and
foresight. 81
82. DIKW Pyramid
• Shows how data can be:
• enriched with context to create information
• information can be supplied with meaning to create knowledge
• knowledge can be integrated to form wisdom.
82
84. Overview
• BA provides a means of blueprinting or
concretely expressing the design of the
business.
• It helps an organization align its strategic
vision with its underlying execution.
• It includes linkages from abstract concepts
to more concrete ones.
• Linkages provide guidance as to how to
align the business and its information
technology.
84
85. Business as Layered System
• Top layer: strategic layer occupied by C-
level executives and advisory groups
• Middle layer: tactical or managerial layer
that seeks to steer the organization in
alignment with the strategy
• Bottom layer: operations layer where a
business executes its core processes and
delivers value to its customers.
85
86. Business as Layered System
• Each layer’s goals and objectives are
influenced by and often defined by the
layer above.
• Communication flows bottom-up via the
collection of metrics.
• Activity monitoring at the operations layer
generates Performance Indicators (PIs) and
metrics.
86
87. Business as Layered System
• They get aggregated to create Key
Performance Indicators (KPIs) used at the
tactical layer.
• KPIs can be aligned with Critical Success
Factors (CSFs) at the strategic layer.
87
88. Big Data & Business Layers
• Big Data has ties to business architecture at each of
the organizational layers.
• It help convert data into information (what) and
provide meaning to generate knowledge (how) from
information.
• The information can be examined to answer
questions regarding how the business is performing.
• With such knowledge, the strategic layer could
provide insight (why) of which the best strategy
needs be adopted in order to enhance the
performance.
88
89. DIKW Pyramid & Business
Layers
Modified DIKW pyramid that aligns
with Strategic, Tactical and
Operational corporate levels
89
90. Feed Back Loop
• The strategic layer drives response via the
application of judgment by making
decisions that are communicated as
constraints to the tactical layer.
• The tactical layer leverages this knowledge
to generate priorities and actions that
conform to corporate direction.
• These actions adjust the execution of
business at the operational layer. 90
91. Feed Back Loop
• The change in the experience of internal
stakeholders and external customers as they
deliver and consume business services should
be measurable.
• This change(result) should surface and be
visible in the data in the form of changed PIs
that are then aggregated into KPIs.
• Over time, the strategic and management
layers injection of judgment and action into
the loop will serve to refine the delivery of
business services.
91
92. The “Anatomy of Knowledge”
An organization can relate and align its organizational layers
by creating a virtuous cycle via a feedback loop.
92
94. Data Analytics & Data Science
• To find new insights that can drive more
efficient and effective operations, provide
management the ability to steer the
business proactively.
• Allow the C-suite to better formulate and
assess their strategic initiatives.
• Looking for new ways to gain a
competitive edge.
94
95. Digitization
95
• The use of digital artifacts saves both time
and cost.
• As consumers connect to a business
through their interaction with these digital
substitutes, it leads to an opportunity to
collect further “secondary” data.
96. Digitization
• Collecting secondary data can be
important for businesses for:
• customized marketing
• automated recommendations
• development of optimized product
features.
96
97. Affordable Technology
• Technology capable of
storing and processing
large quantities of
diverse data has become
increasingly affordable.
• Big Data solutions often
leverage open-source
software that executes
on commodity hardware.
97
98. Social Media
• Has empowered customers to provide feedback
in near-real-time via open and public mediums.
• businesses are storing increasing amounts of
data from social media sites.
• This information feeds Big Data analysis
algorithms that provide:
• better levels of service
• increase sales
• enable targeted marketing
• create new products and services. 98
99. Hyper-connected communities
and devices
• The internet and the proliferation of cellular
and wi-fi networks has enabled more people
and their devices to be continuously.
• The proliferation of internet connected sensors
such as the internet of things (IOT) generate the
number of available data streams increase.
99
100. Cloud Computing
• Allows to the creation
of environments that
are capable of
providing highly
scalable, on-demand
IT resources.
100
Editor's Notes
How to learn
Emphasize on programming with Java
Apology for document
Document is not quite complete
Some parts are irrelevant
Some just get added because of its interesting nature
Some are missing
Some are not part of this documentß
Student must lecture on undocumented details
Where do data come from? => People
Data creating devices
- computer
- mobile
- other kinds of devices
Other devices
- Smart TV
- Games consoles
- Smart Home / Smart appliance
- smart personal gadget
We spend 3h 39m on smart phone
Search
Text
Video
email
Google. And Bind
Social media is one of the biggest sources of data
98.8% accessing from mobile
East + South + SEA > 50% of users
25 – 34 new work force
- see advertisement of
- first car
- first condo
- new investor training
- cosmetics
2H 16m on Social Media
¼ of internet users use SM for work.
-> social media incorporate real business functionality
Line app
Line Payment
English, Spanish
India (Hindi) 3rd
Middle east – Arabic
Indonesian
Top 5 > 80%
1 person has 1.3 mobile or 3 persons own 4 mobile.
¾ has access to internet
¾ actively use SM.
Heating market is streaming
New kinds of devices are coming
- Smart Home device
- Smart watch
- VR
Using Internet ~ 9Hrs => what do they do?
On social ~ 3 Hrs
On TV ~ 3.5 Hrs
Advertising goto mobile => cheaper / targeting
Internet users grows about 2%
97% of internet user is from mobile
Use internet on Mobile ~ 5hrs
SM users growth = 4.7%
99% access from mobile
Interesting Non-US products are LINE and Tiktok
On 2021 tiktok reachs 54%
Sensor ( traffic cam, satellite) , IoT, Smart Appliances
Quintillion => America (10**18) / English (10**30)
Unstructure data mostly generated by human both social and business
Business need insight
Big data solution in need
Need of new solution to handle massive data
Handling Large volume of data (Zetta Bytes & Yotta Bytes) which are structured or unstructured.
Data Island – each machine keeps its data
Datawarehouse – centralized
- BI on top – dashboard / reporting
- IT & DBAs control and analyze
Analytics Sanbox - suite of tool and solution
- replication (cheap storage)
- business analysts control and analyze data
Gather / collect => group together
Each group share attribute / property or feature
Log file
Facebook json data
Eg. Pole data
- Min / Max / Average of age , salary, education level, etc
the reality that the generation of high value analytic results increases the complexity and cost of the analytic environment.
Answer to What
Example questions:
What was the sales volume over the past 12 months?
What is the number of support calls received as categorized by severity and geographic location?
What is the monthly commission earned by each sales agent?
Queries are executed on operational data stores from within an enterprise,
for example a Customer Relationship Management system (CRM) or Enterprise Resource Planning (ERP) system
Answer to Why
Such questions include: • Why were Q2 sales less than Q1 sales? • Why have there been more support calls originating from the Eastern region than from the Western region? • Why was there an increase in patient re-admission rates over the past three months?
A feature of Roll-Up Properties, which aggregate data from multiple source properties into a single property. Roll-Up Reporting is a special kind of reporting that lets you analyze the aggregated data that's in a Roll-Up Property.
A drill down report is a report which allows users to navigate to a different layer of data granularity by navigating and clicking a specific data element on a web page or in an application. Drill down allows users to explore multidimensional data by navigating from one level down to a more detailed level.
Questions are usually formulated using a what-if rationale, such as the following: • What are the chances that a customer will default on a loan if they have missed a monthly payment? • What will be the patient survival rate if Drug B is administered instead of Drug A? • If a customer has purchased Products A and B, what are the chances that they will also purchase Product C?
Sample questions may include:
• Among three drugs, which one provides the best results?
• When is the best time to trade a particular stock?
Enables an organization to gain insight into the performance of an enterprise by analyzing data generated by its business processes and information systems.
The results of the analysis can be used by management to steer the business in an effort to correct detected issues or otherwise enhance organizational performance.
>>>> Insight about performance from analyzed data
>>>> analyzed data are collected from business process and IT system
>>>> result of analysis help management make a decision to solve the root clause or problems.
System of S/W & H/W
>>>> BI need data from across enterprise
>>>> gathers at centralized data warehouse and run analytics queries.
>>>> graphical output on dashboard
>>>> allows manger to easily interpret
>>> later, can refine quire to gain further insight or answer
KPIs are often displayed via a KPI dashboard.
The dashboard consolidates the display of multiple KPIs and compares the actual measurements with threshold values that define the acceptable value range of the KPI.
>>> metric or gauge of business performance
>>>> link with goals and objective of enterprise
>>>> to identify problem
>>>> show how much compliance to regulation
>>>> reference point to measure the overall performance
Right time at the right place
Due to the abundance of tools and databases that natively support structured data, it rarely requires special consideration in regards to processing or storage.
Examples of this type of data include banking transactions, invoices, and customer records.
A text file may contain the contents of various tweets or blog postings.
Binary files are often media files that contain image, audio or video data.
Technically, both text and binary files have a structure defined by the file format itself, but this aspect is disregarded, and the notion of being unstructured is in relation to the format of the data contained in the file itself.
Due to the textual nature of this data and its conformance to some level of structure, it is more easily processed than unstructured data.
Examples of common sources of semi-structured data include electronic data interchange (EDI) files, spreadsheets, RSS feeds and sensor data.
An example of pre-processing of semi-structured data would be the validation of an XML file to ensure that it conformed to its schema definition.
Examples of metadata include:
• XML tags providing the author and creation date of a document
• attributes providing the file size and resolution of a digital photograph Big Data solutions rely on metadata, particularly when processing semi-structured and unstructured data.
Data Classification is the classification of data based on its level of sensitivity and the impact to the Organizational Entity or Personal Entity should that data be subject to Disclosure-Alteration-Destruction (DAD) without authorization.
Data Provenance is Provenance information relevant or pertaining to evaluating the source or author of the data.
Provenance : source , origin , history of ownership
Pedegree
Data lineage includes the data's origins, what happens to it and where it moves over time.[1] Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.
Data Pedigree refers to the data relationship to an authoritative Entity.
Data Pedigree is an attribute of Data Provenance and could be provided as metadata.
Data Pedigree should be considered during Data Classification
Part of Big Data Adoption factors
>>> business constantly improvement performance to sustain profit and reduce costs.
>>> high competition , need new customers and keep existing ones.
>>> offer new product and service
>>> or new increased value added of original product and services
>>> expand BI beyond hindsight
>>>> To understand ever-changing market needs external data (market share / penetration / demography)
>>> add more context (from external data) to internal data
>>> gain more insight and foresight
Y-on-Y sales drop 10%
But global market shrink by 20%
=> We may still perform better?
Data -> Information e.g. PM2.5 + location
Information -> Knowledge
http://air4thai.pcd.go.th/webV2/aqi_info.php
0 – 25 very good
26 – 50 good
51 – 100 moderate
101- 200 start affect health
201 – has effect with health
Knowledge -> Wisdom
PM2.5 > 50 = wearing mass
Building a house -> design on blueprint first
Same as business
align its strategic vision with its underlying execution
whether they be technical resources or human capital.
abstract concepts => business mission, vision, strategy and goals
concrete ones => business services, organizational structure, key performance indicators and application services.
>>> Upper layer define lower layer’s goals and objectives
>>> Lower layer send metric (collected data) upward.
>>> Activity monitoring at the operations layer generates
Performance Indicators (PIs)
and metrics,
for both services and processes.
These KPIs can be aligned with Critical Success Factors (CSFs) at the strategic layer,
which in turn help measure progress being made toward the achievement of strategic goals and objectives.
CSFs are the cause of your success,
whereas KPIs are the effects of your actions.
we’re asking “what must we do to be successful?” (CSFs)
and “what indicates that we’re winning?”
With such knowledge, the strategic layer can provide further insight to help answer questions of which strategy needs to change or be adopted in order to correct or enhance the performance.
>>The strategic layer drives response via the application of judgment by
>>>>making decisions regarding corporate strategy, policy, goals and objectives
>>>>that are communicated as constraints to the tactical layer.
>>The tactical layer in turn leverages this knowledge
>>>>to generate priorities and actions
that conform to corporate direction.
>>>>These actions adjust the execution of business
at the operational layer.
Recall that KPIs are metrics that can be associated with critical success factors that inform the executive team as to whether or not their strategies are working.
>>>> measure change in experience of internal stakeholder
and external customers
>>>> result (change) should be visible in the form of collect data as PIs which, in turn, get aggregated into KPIs.
>>>> judgement and action would lead to refining of business services
a diagram produced by Joe Gollner in his blog post “The Anatomy of Knowledge
>>> Add on-line channel
>>> Collect interaction between user and digitized data as ”secondary data”