Overview of BA Discussion
Business Analytics (BA)
Overview
History
Types of Business Analytics
Real world examples
Challenges
Relations to Data Mining
Business Analytics (BA) : an
overview
BA can be considered a subset of Business intelligence
A set of skills, technologies, applications and practices
exploration and investigation of past business performance
to gain insight and drive business planning.
Like Business Intelligence, BA can focus either on the
business as a whole or only on segments of it
Focuses on developing new insights and understanding
of performance based on data and statistical methods
BA : Short History
Analytics in business dates far before computing
Frederick Taylor, father of scientific management, 19th
century
time management exercises used in industrial settings
Henry Ford : assembly line pacing used to improve output
and business profitability
BA becomes widespread when computers were used in
DSS systems in the 60’s
Evolved into ERP, data warehouses, etc.
Types of Business Analytics
Reporting or Descriptive Analytics
Affinity grouping
Clustering
Modeling or Predictive analytics
BA: Reporting
Based on the need to locate and distribute business
insights and experiences
Often involves ETL procedures used alongside a data
warehousing scheme
The data is then collected, quantified, and organized
using reporting tools
Reporting, allows for information describing different
views of an enterprise to come together one place
A user could query a production and marketing database to
determine if production of a product could be moved closer
to where a product is sold
BA: Affinity grouping
A tool used by businesses and
organizations to take ideas
and data and organize them.
Often takes the form of an affinity diagram
Enables data and ideas stemming from
brainstorming to be sorted into groups
Sorting is based on their natural relationships
BA: Clustering
Placing a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in
some sense or another) to each other than to those in
other clusters – wikipedia
Is a main task of explorative data mining and statistical
data analysis
Clustering is a general task that does not have one set
solution
Clustering can be hard or fuzzy
Can be done by people or machines
The latter is preferred
BA: how do we model clusters?
Connectivity models – how data can be connected to
other points
Density models – defining a cluster by determining where
sets of data points are densest
Distribution model – clusters are modeled using statistical
distributions
Expectation maximization
BA: Predictive Analysis
Stems from the desire to predict future events through
analyzing data an enterprise has collected
Pattern exploitation results in the identification of
opportunities and also risks
Allow relationships in disparate data to be identified
Helps guide in decision making in a business
Is often implemented in the form of data mining
BA : Examples
Credit company– uses business analytics to track credit risk of
customers as well as matching customers to offerings
Sales and offers – companies can track customer interaction,
and use that information to determine appropriate product
offerings.
Sales groups can use BA to optimize inventory and analyze
past sales
Could measure peak purchasing times for products
Could decide whether or not to stock poorly selling items
Give examples of business cases where data mining might be
useful, and describe how data mining would be used
Preventing credit card fraud through detecting spending patterns
Inventory management by tracking sales
BA : Challenges
Acquiring sufficient volumes of high quality data
Most data acquired in the field is unsorted and appears in
many different formats
When dealing with high volume data, deciding what is
important and what is noise
Rapidly reacting storage structures
BA can influence customer interactions, and as such that
information must be available fast
Ex: a customized sales pitch
Business Analytics & Data Mining
Data Mining is an important sub task of Business
Analytics
Both Predictive analysis and clustering tasks
utilize information retrieved from data mining
Data mining helps handle some of the specific
problems faced when conducting Business
Analytics
Dealing with and sorting through large data sets
Data Mining : An Overview
What is Data Mining ?
History
Applications of Data Mining
Detecting data discrepancies or outliers
Relationship identification
Data-Function mapping for modeling/prediction
Categorizing and Summarizing Data
Standards
Challenges
Data Mining : What is it?
Applying statistical analysis techniques to data
the goal often being to determine unnoticed patterns or to
collect categorized information
turns collected data into understandable structures
Data Mining is often used as a buzz word to describe
processing large amounts of data
In essence, its correct use relates to discovery of new
things through observation
Synonymous with knowledge discovery
Data Mining : History
Though HNC trademarked the term in 1990, hands on
pattern extraction is centuries old
As long as statistic analysis has existed
Discoveries in computer science have increasingly
shifted the field from hands on to machine dependent,
this allows for :
The use of data indexing and DB systems to handle data
efficiently
The application of statistical algorithms on a large scale,
possibly in a distributed manner, with less error
Data Mining : Use : Application
Data Mining is often broken into several different
categories of tasks
Detecting data discrepancies or outliers
Relationship identification
Data-Function mapping for modeling/prediction
Categorizing and Summarizing Data
Data Mining : Finding outliers
The process of analyzing large, mostly
homogeneous, sets of data and determining
which sets or points
“go with the flow” and conform with patterns the rest
of the data seem to follow
do not follow expected results when viewed against
the entire set of data
An outlier can be a point or set of points, but can
also be defined through other means
A period of time could yield unexpected results
Ex. Network Intrusion
Data Mining : Techniques in finding outliers
Rule Based – deciding a set of rules that
determine an outlier (or what isn’t one)
Can be fuzzy or hard rules
Cluster Analysis – As mentioned earlier
Distance or Standard Deviation – Determining an
average over a data set and marking points that
aren’t within a Deviation or Distance
Applications of Outlier Detection
Network Intrusion Detection
Unusual bursts of network activity
Identity Theft Detection
Unusual spending or customer activity
Detecting Software bugs
Software does not deliver expected outputs
Sensor event detection
Monitoring patient health fluctuations in a medical setting
Preprocessing
Removing data skews based on extenuating
circumstances
Relationship Discovery: Basics
Understanding how data is related is a key factor
in trend and knowledge discovery
This is the definition of data mining
Ex: Which products are often bought before a major
forecasted storm
{hamburger buns} => {???}
With small sets of data, or with correlations that
aren’t subtle (as the one above), identifying
relationships is not as difficult
With large data sets or subtle relations a
combination of rule generation and data analysis
can be used to expedite the process
Relationship Discovery: How its done
Since the number of relationships between points
of data could be boundless, two important
concepts are often introduced in relationship
discovery:
The amount of data within which a relationship
might exist, called the support of a rule.
The probability that data in the support will verify a
selected rule, called the confidence of a rule.
Relationship Discovery: How its done
Generally we apply minimum bounds to both the support of
a rule and its confidence to determine relationships
First : determine possible relationships
Set a minimum support
Orders with hamburgers, Orders with hamburger buns
Other, user specific rules can be used here
Second : take the remaining sets, look for patterns in the
items sets such that occurrence rate is above the minimum
confidence
How many people bought hamburgers and buns together
Ex: we find that if the customer is a male, and they buy
diapers, they will also buy beer
{male, diapers} => {beer}
Matching data to functions
Often, it is desirable to match data sets and the
factors that determine them to functions
Allows for the possibility of predicting future results
Involves learning how dependent and
independent variables in our data interact
Dependent : the result, or where a point exists
Independent : an cause or circumstance that
determines the dependent variable
If we know how dependent and independent
variables interact, we can create a function and
run simulations to see results
Uses of Function-Data Mapping
Weather Forecasting
Determining what conditions lead to what kinds of
weather
Stock market analysis
When to buy and when to sell
Crime Prevention
What conditions cause or prevent crime
Categorizing
Categorizing – Often we want to separate data
based off of a set of predefined attributes
Very helpful in pattern recognition
Ex: a persons political preference
The process :
we synthetically generate or measure a set of
observations (data points) with known categories
we extract properties from said observations which
we believe contribute to the category
These are called explanatory variables
Finally we examine new data for these properties
Summarizing
Summarizing – we almost never want to look at all of
the data individually
Having too much data can actually hider the decision
making process
Known as information overload
Summarizing takes the results from data mining and
transforms it into formats that can be easily read
without omitting important information
Summarizing might :
Extract and display only important data
correlate and abstract data to display trends
Formats Include : Reports, Graphs, Dashboards, etc.
Standards : CRISP-DM
Cross Industry Standard Process for Data Mining
describes common practice for conducting data mining in an
enterprise setting
KD nuggets – a community resource in DM and analytics
took polls and found CRISP-DM was the top methodology
in 02’, 04’, & 07’
Six step methodology
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
CRISP-DM : Explained
Business Understanding
Determining the business purpose
Define success conditions – how do we know we succeeded
Ex : improved prediction accuracy
Map purpose/success conditions to data mining results
Ex: fraud prevention => detect deviations
Data Understanding
Collecting and exploring data – defining its attributes
Data quality verification
CRISP-DM : Explained
Data Preparation
Data Cleaning
Normalization – fitting data within ranges
Outlier removal – removing cases that could skew the model
Handle missing attributes – the data was not obtained
Formatting – changing data so that it fits with our tools
Modeling – fitting the data to a model following the
methods previously described and then interpreting that
model
Assess the accuracy of the collected data
General purpose divided into prediction or description
CRISP-DM : Explained
Evaluation – look at results and measure them with respect
to the success cases defined earlier
Determine if one has succeeded
Determine next steps, how do we apply the results
Deployment – The execution of a strategy for using the
results of our data mining
Includes preparing ways to monitor and maintain the
application of data mining results in the day to day
Includes some sort of final summary
SEMMA
Sample, Explore, Modify, Model and Assess
Proposed by SAS Institute : A producer of BI and BA
software suites.
Though this model is often considered general SAS
prefers to apply it directly to their products
Focuses mainly on data mining and not on applying results
to business (unlike CRISP-DM)
Sampl
e
selecting the data set
Explor
e
Understand data through discovering relationships, both expected and
otherwise
Modify Transform and clean the data in order to prepare it for the modeling
process
Model Apply models to the data in order to discover trends and make predictions
Assess Evaluate the results of the modeling process to determine the reliability of
the mined data
Challenges in data mining
Not enough or too much data
Oftentimes it is difficult to access sufficient quantities of data
for small enterprises
If the enterprise is large however, sometimes there is too
much and deciding what to keep is difficult
Acquiring clean data
Multiple formats or no format at all
Privacy and ethical concerns
Data aggregation : data compiled from multiple sources can
lead to revelations that violate privacy concerns
Ex: anonymous data is collected and aggregated, leading to
identification
Notas do Editor
Taylor : mechanical engineer who focused on improving industrial efficiency
DSS – Decision Support Systems, ERP – Enterprise Resource Planning
4:40
Fuzzy clustering – each object has a likeliness of belonging to a cluster
Expected max - multivariate normal distributions - One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points
17:20
Agrawal, R.; Imieliński, T.; Swami, A. (1993). "Mining association rules between sets of items in large databases". Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93. pp. 207. doi:10.1145/170035.170072.ISBN 0897915925.
http://en.wikipedia.org/wiki/Association_rule_learning#Useful_Concepts
Agrawal - Agrawal, R.; Imieliński, T.; Swami, A. (1993). "Mining association rules between sets of items in large databases". Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93. pp. 207
30 min