Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?

2© 2015 Pivotal Software, Inc. All rights reserved. 2© 2015 Pivotal Software, Inc. All rights reserved.
The Science of Segmentation:
What Questions Should You Be Asking Your Data?
April 14, 2015
Jarrod Vawdrey, Data Scientist @ Pivotal
Grace Gee, Data Scientist @ Pivotal

Agenda
• Typical State Of Companies New To Big Data Analytics
– Benefits of Big Data technologies
• When to Use Segmentation
– Common business problems
– Types of available data
• Use Cases & Approaches To Segmentation
– Common approaches
– Best practices

Typical State of Companies
New to Big Data Analytics

Typical State of Companies New to Analytics
• Companies in the process of transforming into a data-
driven organization often ask similar questions about
where to start:
How do I make data available for
my analysts?What tools are needed to efficiently
process and build models on my big data
sets?
What data should I be collecting and
archiving?
Where and how can I start to use all my
data to quickly gain actionable insights
and begin integrating data science into our
organization’s practices?
How do I leverage data to generate
value for stakeholders? How do I enable analysts and data
scientists to be more effective?

Common Business Challenges
Data Availability
• Disparate data sources
• No integration of data across lines of businesses
• Insufficient data
• Unknown single source of truth
Slow Time-to-Insight
• Often outdated analytics architectures focused on operational
processes hamper experimental nature of big data analytics
• Lack of knowledge about analytics software for in-place
processing of and computation on Big Data
• Company organizational structure inhibits fast acquisition of
data and communication of insights

Big Data Technologies for Data-Driven
Organizations
• Data Lake: efficient, massively scalable Big Data storage platform
– Store all data: we don’t want to inhibit the ability to answer future
questions
– Save all (structured, unstructured, and semi-structured) types of
data: we may not immediately know “optimal” form to store data
for analysis
– Work with multiple types of data from one location
– Centralized location of data accessible to all organizations
• Agile Analytics Platform: purpose-built architecture for getting
results and gaining insights quickly through parallel, in-place data
analytics
– No required sampling due to limited memory
– No data movement
– Scalable analytics

Big Data Technologies for Data-Driven
Organizations
Enterprise
Apps
Reporting
Prioritized
Operational
Processes
Data Sources
Inventory
Optimization
Demand
Forecasting
Proprietary
Structured Data
Proprietary
Unstructured Data
Partner Data
Self Reporting:
Google, Weblogs,
Twitter
External Sources:
Census, Nielsen,
Weather, etc…
Sensors
HAWQGreenplum DB
Pivotal HD (HDFS)
GemFire XD
MADlib, PL/R,
PL/Python, etc.
Platform-Driven Data Science
1 0
0 1 01 0
0 1
0 1
1 1 0
Fraud
Detection

Segmentation:
An important step for understanding data
• What is segmentation?
– Automatic grouping of entities based on a common set of
features
– Identification of patterns amongst similar entities
• What is segmentation good for?
– Identifying select features that greatly differentiate groups
of entities
• E.g. Identifying behaviors of high-profit suppliers and low-profit
suppliers
– Identifying similar characteristics amongst different groups
• E.g. Identifying similar market segments to target
– Predicting characteristics and behaviors of new or
unknown entities
• E.g. Inferring missing labels, predicting market response to new
products

Segmentation & Big Data Technologies
• Segmentation problems often deal with:
– Multiple data sources from multiple lines of business and external sources
– BIG DATA, particularly from sensor data or transactions/point of sales
– High-dimensional feature sets
• Big Data technologies help make segmentation problems become feasible
and bring faster time-to-insights through:
– Ability to leverage and integrate all relevant data sources, no matter how large
 Data Lake
– Using ALL data to train segmentation models and not rely on samples or a
subset of data that fits into memory
 MPP databases, Hadoop, HAWQ, MADlib, Spark, etc.
– Quickly building segmentation models and scoring new entities through
parallelized, in-place computation
 MPP databases, Hadoop, HAWQ, MADlib, Spark, etc.
How cutting edge Big Data technology enables faster insights

When To Use Segmentation

Common Business Problems
Customer Micro-targeting
Identifying market segments and their
purchasing behaviors
Operations & Logistics
Identifying behaviors of underperforming or
outperforming stores, suppliers, delivery
services, etc.
Fraud
Identifying normal and anomalous user
behaviors within networks
Domain Resolution
Inferring labels or groups of similar web domains
where segmentation can help

Data Used In Segmentation of Customers
Power in leveraging both internal and external datasets
Demographic
profiles
Sensor data
Product
metadata
Shipment data Store metadata
Transactions
and invoices
Delivery
information
Marketing plans
External data:
Census,
Nielsen, social
networks, etc.

Gaining Additional Value From External Data
Often companies do not or cannot collect
sufficient data about their customers to
construct a complete profile. Augmenting
internal data with external sources allows
companies to:
• Develop a 360 degree customer view
• Gain insights into how consumers are
interacting with competitors
• Improve accuracy of predictive models
• Increase the value of internal data
Point of
sales
Transaction
data
Web/Apps
logs
Investments
Market
basket
Loans
Traffic
Weather
IXI wealth
complete
Haver time
series
Dept. of
Labor
CRM
Internal External
Note: This list only represents a subset of data sources that should be considered.

Example: Using Census Data to Build
Family Profiles
Consumer Packaged Goods (CPG) companies are
often interested in building market profiles for
micro-targeting to improve marketing strategies and
supply chain planning.
Hypothesis:
• Not only are CPG companies interested in the
individual consumer, but in the family profile as
well
– E.g. Consumption of child products is
affected by family size
Approach:
• Census Public Use Microdata Sample (PUMS)
files include person records and housing
records which can be combined in
segmentation models to build rich family
profiles.
fraction of households
Households with Children*
*Children as defined by a certain age group

Use Cases &
Approaches to Segmentation

Common Approaches for Implementing
Segmentation
Data Step
• Identify join
relationships across
all data sources
• Aggregate data to
common granularity
Feature Step
• Identify and create
features that can
characterize the
entities you want to
segment, e.g. age,
gender, types of last
transactions,
average time
between visits,
average spend,
sensitivity to price
change, etc.
Model Step
• Candidate
algorithms: clustering
strategies like k-
means & hierarchical
clustering,
regression or
hierarchical modeling
and grouping by
similar coefficients,
ensemble methods,
etc.
Analysis of Results
• Look at average
features across
clusters
• Look at average
cluster features vs.
population average
(e.g. to find
anomalous behavior)
• Identify common
features amongst
segments (e.g.
opportunities for
cross-sell/up-sell)

• Objective:
Identify characteristics of consumers that prefer certain brands or
products
• Common business challenges:
– No integration of data amongst different lines of businesses
– Internal data is not sufficient for building profiles
– No information about which consumers are more/less profitable
• Data sources:
– Point of sales, demographic data, loyalty data,
product and store metadata, external data
Example: Profiling market and consumer
segments

• Identify relationships and joins amongst all data sources
• Clean data by removing outliers and imputing missing values if appropriate
– For example using the median or weighted average value for a state to
impute into a missing value for a county
• Aggregate or select data to common granularity that makes sense
– For example, demographic profiles can be built at the zip code or county level,
and store profiles can be built at the individual store or tier or region level
Step 1: Consolidate Data Sources
• Do gap analysis to determine the scope of data
sufficient for analysis
- For example, a certain subset of customers may
be missing data for a large time period and should
be scoped out
time
numberofstoresreporting
Using an MPP database like Greenplum, we can join tables with billions of rows
in a little over a minute.

Step 2: Feature Engineering & Selection
Transactions&PointofSales
Total sales
Change in sales
Price
Discount
Market basket
Store/Location
Geolocation
Weather
Product
Department/Type
Color
Size
Brand
Package
Demographic
Age
Gender
Income
Employment
Education
Family size
Marital status
Citizenship
Language
Loyalty
Status
Length of
membership
Activity
It’s common for data scientists to generate hundreds of thousands of features.

Step 2: Feature Engineering & Selection
In order to reduce feature dimensionality and
account for unwanted bias due to the
inclusion of highly correlated features, we can
filter features using approaches such as :
• Principal Component Analysis
• Reducing the dimensionality of the feature
space to a select number of principal
components
• Iterative pairwise correlation comparison
• Calculate NxN pairwise correlations, where
N is the number features
• Remove the feature existing in the greatest
number of correlated pairs (correlation
coefficient greater than some threshold)
• Iterate until no correlated pairs exist
Example: Subset of feature correlation matrix. The
large number of features requires an automated
approach to feature selection

• Example: K-means Clustering
1. Create single feature vector for each entity, e.g.
consumer
2. Use k-means clustering to identify k consumer
segments
i. Try multiple training trials for multiple values of k
ii. Use any one of a variety of techniques for selecting
optimal k, e.g. silhouette coefficient
3. Look at average features across segments to identify
segment characteristics
4. Look at purchasing behaviors of each segment to
identify segment preferences
Step 3: Build Models

• Segmentation models used to
identify and profile consumer groups
– Calculate descriptive statistics for each
segment and compare to uncover
previously hidden opportunities
• Cross-sell/up-sell opportunities
• Potential data issues or supply chain
execution opportunities regarding
unequal proportion of product shipment
or inventory to regional preference
• Rich set of reusable data assets
made available for ongoing analysis
& reporting
Step 4: Extract Business Value from Results
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Feature 1
Feature 2
Feature 3
.
.
.
low value high value
compared across clusters

What Questions Should You Be Asking Your
Data?
• Are you collecting the right data & storing it in the right
fashion?
• Do you have the right technology to support your data
and data science endeavors?
• Where are the gaps in your data? How can external
sources fill those gaps?
• How can your data sources be joined or aggregated
together to build rich feature sets?
• How can you extract business value from your data?
Segmentation will help you answer all of these questions!

Thank You.

Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?

Similar to Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data? (20)

More from VMware Tanzu

More from VMware Tanzu (20)

Recently uploaded

Recently uploaded (20)

Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?