4. About Me
•
•
•
•
•
•
•
•
•
•
•
Degree in Applied Mathematics
Over 20 years with Oracle software
Over 10 years with data warehouses
Big Data Analyst
Author of numerous Oracle books
Blogger: http://ians-oracle.blogspot.com/
Oracle ACE
IOUG Past-President
TOUG Board Member
Toronto based
Twitter: @iabramson
4
6. Why Big data?
• New data sources
• Unprecedented volume
• Real World Issues
– Data Systems are reaching capacity
requiring high cost alternatives
– Archive data is too far offline
– Organizations require cost effective
options
– Retain all data for future analysis
6
7. “Data
becomes
“Big Data”,
when the
size of the
data
becomes a
part of the
problem”
Roger Magoulas
(O’Reily Research)
Big data is
high-volume,
high-velocity
and highvariety
information
assets that
demand costeffective,
innovative
forms of
information
processing for
enhanced
insight and
decision
making.
Gartner:
Big Data is a
term/concept,
which is used
as a generic
name for a
“generation of
technologies
and
architectures
designed to
extract value
economically
from very large
volumes of a
wide variety of
data by
enabling highvelocity
capture,
discovery,
and/or
analysis”.
IDC:
Big data is the
term for a
collection of
data sets so
large and
complex that it
becomes
difficult to
process using
on-hand
database
management
tools or
traditional data
processing
applications.
The challenges
include
capture,
curation,
storage,
search,
sharing,
transfer,
analysis, and
visualization.
Wikipedia:
Big Data Defined
7
8. The Attributes of Big Data
• Classic Data Attributes:
– Volume
– Velocity
– Variety
• Big Data Technical Attributes
– massive, parallel computing environment
– infinitely scalable computing clusters, including cloud
• Three main technical requirements
– Need medium to accommodate large volumes for storage and data streaming
– Require the computing horsepower and architectural approach which allows
for the processing of the data where it exists and not via extraction and
processing
– Use the appropriate programming which allows for a computational paradigm,
which performs computations in a highly parallel and scalable environment
8
9. Challenges for Big Data
http://tdwi.org/blogs/fern-halper/2013/10/four-big-data-challenges.aspx
Confidential
9
10. Big Data and Data Warehouse – war or peaceful
coexistence?
•
The problem – different uses – different schemas and different partitioning. In most cases the requirements are orthogonal – impossible to
provide optimal for everybody data partitioning/indexing
•
The ideal goal – acquire and store “as is” – access using multiple models. Need for powerful artificial
intelligence knowledge base and data access code generators.
•
Will never be optimal for everybody unless huge redundancy
•
Problems are less painful if most of the data are read anyway. Good for analytics, not good for OLTP
•
Eventually Big data platforms will become DW platforms with well developed access interfaces
•
Until then -> acquire and store and then distribute on demand to conventional DW and data marts
10
11. The New Data Architecture
Data Archive
Operational Systems
Enterprise Data
Social & Clickstream
Sensor Generated
Big Data
ODS
Hadoop
Public Data
HDFS
Map/Reduce
Historical Data
Data Warehouse/BI/Analytics
Other New Sources
11
21. Big Data vs. BI presentation viewpoint
IMPACT
Confidential
21
22. Questions for BI and Big Data
• Sample questions for BI
– What is my sales volume by time, by region, by store, by season?
– What is average review rating by product category, by product?
What is the dynamic of reviews, what are the trends?
• Sample questions for Big Data/ Data Science
– How change in review ratings impact sales?
– What is the time lag between review rating change and sales
volume change?
– What products are purchased together and can I improve product
recommendations?
Confidential
22
23. DATA SCIENCE
Data Science
Skills
Science
Purpose
• State the Problem
Research
• Discover information
about topic
Hypothesis • Predict the Outcome
Experiment
Analysis
Conclusion
Confidential
• Develop a process to
test the hypothesis
• Record the results
• Compare hypothesis
and results
23
24. Data Science Team
Each team would include:
•
Data Science Analyst – excellent communication skills, science and analytical
background.
•
Data Science Researcher/Solution Architect – good communication,, good
statistical/math, working knowledge 2 out of the following data science libraries (Mahoot
or any other machine learning, Rhadoop, R, SAS, SPSS) –
•
Data Science Technologist – acceptable communication skills, 25% deployable to the
client site (as minimum few should be deployable, others can be offshore), good
developer, working knowledge of Big data and related technologies
•
Data Science presentation engineer – knowledge BI and presentation tools
Nordstrom’s Big Data Team Mission:
“Delighting Customers through data-driven
products”
24
27. Top 10 Use Cases (2013 Computerworld)
1. Modeling Risk
2. Customer Churn Analysis
3. Recommendation Engines
4. Ad Targeting
5. POS Transaction Analysis
6. Analysis of network data to predict future failures
7. Threat Analysis
8. Trade Surveillance
9. Search Quality
10.Data Sandbox
http://www.computerworld.com.sg/resource/storage/iiis-2013-technical-workshops/?page=2
28. The Big Data of Dating
•
From analysis of match.com dating patterns:
•
21+ Million members
•
100+ million hits per month
– January 2nd is the busiest day for people to sign up on dating sites
– Women get 60% more attention if photo is taken indoors
– Men get 19% more attention if theirs is taken outside
– Full-body photos boost both sexes success by 203%
– Posing with animals or your best friends might seem cute but it actually reduces your
popularity by 53 per cent (men) and 42 per cent (women)
– Men get 8% fewer messages if they put up selfies.
– Mentions of words like divorce and separated gets men 52 per cent more messages
– Women who are more forward, using phrases like dinner, drinks or lunch in the first
message get 73 per cent more replies, while men should play it cooler. Those who
mention the same words in their opening message get 35 per cent fewer replies.
Confidential
28
30. Use Case Checklist
• Title - An active description which identifies the goals of the
primary actor
• Characteristics:
–
Primary actor
–
Goal in Context
–
Scope
–
Level
–
Stakeholders and Interests
–
Precondition
• Success criteria
–
Precondition
–
Minimal Guarantees
–
Success Guarantees
–
Trigger
–
Main Success Scenario
–
Extensions
•
Technology & Data Variations List
•
Related Information.
Reference: Alistair Cockburn
31. EXPEDIA CASE STUDY
Archive Use Case
1.5 Petabytes continuous ingestion data
One of the largest Hadoop clusters in the
world
80% Open Source EDW
Staging and Historical
Analysis
Call Center and
Online data
Customer Benefits
Avoided massive cost of new DW
Infrastructure
Able to keep and analyze historical
transactions
Informatica
transformation &
aggregation
Reduce risk of DW replacement
Able to scale on demand using low-cost
servers
Transaction Volume
> 500 GB daily increases from all sources
transaction, social, contact center
Analytic Infrastructure
31
32. Use Case: Sales Analysis
Sales per sq.ft.: Changes Over time
• Fitting the no-intercept line to the scatter of sales over sales floor
brings about visual baseline Sales-per-Sq.Ft. (SpSF) for each year
Mathematically the
SpSF measure is
given by the slope
coefficient of the
trend:
392.51 [CAD/Sq.Ft.]
in 2011 vs.
373.76 [CAD/Sq.Ft.]
in 2012
417 in 2011
417 in 2012
SpSF
33. Looking for Patterns Anomalies
This chart tells us most of the stores have highest sales on Saturday. But, Store X peaks on Friday and
Is also doing well on Mondays. Why?
10000000
9000000
8000000
7000000
6000000
5000000
4000000
3000000
2000000
1000000
0
THU
FRI
SAT
SUN
MON
TUE
WED
34. Affinity Analysis Use Case
Build model that provides the foundation for analyzing and
understanding the factors that influence year over year
change in store performance
•
Affinity Analysis is an input to:
•
•
•
•
•
Identify products purchased in tandem
Provide guidance an recommendations for
upsell and cross-sell
Redesign stores, layouts and planograms
Discount Plans and Promotions
Identifying customer baskets in different
time and geography
•
•
Investigating patterns on fine line and
product levels
Ranking customer baskets by Number of
times bought together Revenue
contributed
37. Related Baskets
Size of the circle show how often
basket has been purchased
Season: 2012-05-16 - 2012-08-28
This kind of analysis can be used
for spotting driver products
1.
2.
3.
4.
Potted annuals/plants, Cell-packs/annual plants
Potted annuals/plants, vegetables/plants
Potted annuals/plants, Outdoor soils/outdoor lawn & plant care
Cell-packs/annual plants, vegetables/annual plants
38. Big Data is Evolving
• The industry is evolving
• Hadoop is now 8 years old since start in 2007 at Yahoo
• CDH 5 recently released
• $2.5B in venture capital in the space
• Hadoop is now considered a standard
• Hbase is an example of a project which has not found a standard
• Many tools today? What will be in 5 years from now?
• How to avoid the big data pitfalls?
• 50% of big data projects fail
• Those who success drive it by focus
• Insight vs. Impact
• Find one problem and fix it
• Data Science
• Change how you do analysis… scientific methods
• New and exciting
• Build a hybrid team to develop Data solutions
• Team can program, knows math and statistics and communicate
Confidential
38
40. Thank You and Questions
Ian Abramson
EPAM Systems
Toronto, Canada
GMT -5
Mobile phone:
Skype:
E-mail:
+1 (416) 254-9286
ian.abramson
Ian_Abramson@epam.com
Confidential
40