In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
7. What is a Big Data app?
?
+
Critical
Business
Problems
=
Impactful
Analytic
Applications
8. Smart Meter
Monitoring for
Customer Value Add
Predictive Inventory
Levels to Minimize
Warehousing Costs
Personalized
Medicine Treatment
Programs
Trade Options and
Futures Pricing
Platform
Source: PARC
Customer Churn Analysis for
Increased Customer Lifetime
Value
14. Predictive Manufacturing +
Smart Manufacturing & Energy
Ad Publisher Campaign Analytics
360 Customer Experience Management
Social Media Monitoring & Analytics
15. The Traditional Way
Business
Discovery
Info
Discovery
Logical Data
Model
Physical Data
Model
System
Staging
Data Ingestion,
Transformation, ETL
Application
Development
Analytics
Data Warehouse Project
12-24 Months to Reach Production
Production
Staging
16. Big Data: A New Hope
Business
Discovery
Info
Discovery
Logical Data
Model
Physical Data
Model
System
Staging
Data Ingestion,
Transformation, ETL
Application
Development
Production
Staging
Analytics
Data Warehouse Project
12-24 Months to Reach Production
App Dev
Business
Discovery
Info
Discovery
Sys.
Stag.
Initial
Data
Ingest
Analytics
Schema on Read
App Dev
Prod.
Stag.
App Dev
App Dev
App Dev
Analytics
Analytics
Analytics
Analytics
Schema
on Read
Schema
on Read
Schema
on Read
Schema
on Read
Big Data Project
3-6 Months to Reach Production
18. Speed to Value: A Case Study
HGST, a Western Digital company, is improving
customer support and product quality by collecting,
analyzing, and acting on massive quantities of
machine and sensor data.
Greatly diminished operational burden with
ability to focus on analysis and driving business
action
Fast project delivery and success
Expertise with Big Data technologies like
Hadoop
KEY STATS
Industry Storage Technology
Solution Machine Data Analysis
Engine
Channel B2B
Cloud Services Cloud::Queries
Cloud::Hadoop
Users Application Developers,
Data Scientists, Analysts
Deployment Amazon Web Services
21. Enablers of Agile Big Data
1.
Managed infrastructure means focusing on Big Data apps
2.
The community tech itself and what it enables
3.
Our customer engagement framework for choosing use
cases that have impact and designing successful solutions
1.
Agile, iterative analytics app dev lifecycle
1.
Our application reference design framework for kick starting
application development
23. Technologies Under the Hood
PART 1
HADOOP
• Java MapReduce
• Streaming MapReduce
• SQL on Hadoop, Pig, Hive
NOSQL DATABASES
• HBase/Accumulo
• Elasticsearch
• Cassandra, MongoDB
STREAM PROCESSING, MESSAGE QUEUES
• Storm
• Kafka
24. Technologies Under the Hood
PART 2
HADOOP INTERFACES
• Hue
• Command Line
STATISTICAL TOOLS
• R, SAS, SPSS
BUSINESS INTELLIGENCE AND DATA VIZ
• Legacy: Cognos, Biz Objects, OBIEE, Microsoft BI
• New Gen: Tableau, Qlikview, SiSense, Kibana
25. Our Unique Toolset Addition
SaaS
Develop & Test Locally with
App/Analytics Scripting &
“Deploy Pack” Orchestration
PaaS
Real-time Analytics
With Cloud::Streams
Interactive Analytics
With Cloud::Queries
Batch Analytics
With Cloud::Hadoop
Abstract to any cloud with
Orchestration DSL
IaaS
Public Cloud
Virtual Private Cloud
Private Cloud
26. Customer Engagement Framework
Service Requirements
Week 1-2
Discovery
Design & Build
Week 3-4
Technical Design
Production
Ongoing
Iterative App Development
Week 5-8+
Platform Rollout
Build Data
Flows
Interview Key
Business
Stakeholders
Define
Business
Benefits
Design Data
Flows
Interview Key
Technical
Stakeholders
Define Target
Use Case
Define
Architecture
Define
Objectives &
Challenges
Develop HighLevel
Approach &
Costs
Identify Data
Sources
Agree to
Project
Plan/Rollout
Real-Time
Data Flow
Architecture
Validation
Standup /
Connect
Environment
Tuning
Solution
Historical
Data
MAJOR ACTIVITIES
• Run 2-4 hour Design Thinking
Workshop
• Review current state metrics
• Review business pain points &
opportunities
• Review application & infrastructure
environment
• Define target use case
• Identify data sources for target
use case
• Develop high level tech
approach and costs
• Define high level benefits
• Develop initial case for action
• Develop go forward plan
• Develop Data Model
• Technical architecture &
integration design
• Stand up environment
• Dashboard design workshops
• Data mapping
• Build prototype dashboard
• Configure prototype
application
• Data load
• Run solution iterations
• Analytical modeling
28. App Reference Design Framework
• A use-case-driven reference design
• A code repository with:
o
o
o
o
Domain-specific sample data sets/sources
Sample data flows
Sample data processors/analytics
Simple data visualization
29. App Reference Designs
Predictive Manufacturing +
Smart Manufacturing & Energy
Ad Publisher Campaign Analytics
360 Customer Experience Management
Social Media Monitoring & Analytics
32. Big Data Benefits
ENABLED BY
• Unstructured data and semi-structured data allow for faster path to data integration
• Real-time analysis and batch analysis with scripting tools
• Schema on read for app-driven data models and data structures
• Local to cloud, small data to big data… tools can talk to each other
New Use Cases
New Analytics
and Analytical
Techniques
More
Data
Time to Value
Faster Iteration
Faster
Data
Increased
Flexibility
quick stories about the transformative effect big data can have on a business... or the worldan app is a use case! big data is not a toy. exploration is great... but to what end? focus leads to value faster.diagram of all the different use cases and industries that big data affectsTHIS WAS A LITTLE LONG, KEEP IT SHORT AND SWEET
quick stories about the transformative effect big data can have on a business... or the worldan app is a use case! big data is not a toy. exploration is great... but to what end? focus leads to value faster.diagram of all the different use cases and industries that big data affects
where are you in terms of adoption of big data applications?already applications in productionapps currently under developmentplanning and evaluation phaseresearching / early explorationI don’t know / No current plans
where are you in terms of adoption of big data applications?already applications in productionapps currently under developmentplanning and evaluation phaseresearching / early explorationI don’t know / No current plans
our use of the terms analytics and analysis is extremely broad. i would consider it both simple statistics as well as more advanced modeling. when i want to call out modeling, i usually use the term "modeling" specifically ... or i will use the phrase "advanced analytics" to differentiate it from simpler analytics. the phrase "analytic application" is essentially meant to mean data-oriented, use case -driven applications.
Have you identified your first big data application use case (or next one)?YesNoI don’t know
quick stories about the transformative effect big data can have on a business... or the worldan app is a use case! big data is not a toy. exploration is great... but to what end? focus leads to value faster.diagram of all the different use cases and industries that big data affects
re-emphasize iterative design here… it’s an organization change and a technology changeone of Jims diagrams that has traditional data analysis application cycles, including the long time spent upfront doing data modeling and ETL transformationbuild the diagram from one step to the next via animationsthis is problematic for three reasons:time to value is slowertakes longer to determine first checkpoint of success or failure of the projectdifficult to iterate
re-emphasize iterative design here… it’s an organization change and a technology changeone of Jims diagrams that has traditional data analysis application cycles, including the long time spent upfront doing data modeling and ETL transformationbuild the diagram from one step to the next via animationsthis is problematic for three reasons:time to value is slowertakes longer to determine first checkpoint of success or failure of the projectdifficult to iterate
diagram of the four customers and how fast they developed apps and how few developers it took to create themDON’T DWELL HERE, DON’T TALK TO EVERY SINGLE USE CASE… STAY HIGH LEVEL
what does HGST do? saying they are a part of western digital isn’t enough.
Poll 3: What is your biggest challenge to realizing the value of Big Data applications? Talent gap/experience Cost of capital investment Big Data technology risk Failed prior projects Other/N/A
java compilingetc versus scripting approaches… we really like using scripting tools
wukong and ironfan are both open source, and we’ve contributed them back!- - - -similar to the slide that shows how Wukong is the DSL for big data app dev, and Ironfan is the DSL for big data infrastructure dev, except incorporate the broader picture of Tachyon the orchestrator and the Deploy Pack application code vesselSo now let’s drill in and look at how we actually deliver a solutionThe ProblemThere are two complementary ways to process Big Data: batch processing and real-time (or stream) processing. These are traditionally viewed as very different approaches to solving problems, especially in a Big Data context, where the toolsets for each kind of processing differ greatly. Typically, for cross-platform there are several issues that slow down analytic development:You Need to Run the Whole Thing – that means that the entire infrastructure has to be running in order to test small changes.You Will Wait 10 Minutes Every Time You Make a Mistake – Compiling Jar files, transferring code, launching jobs, and finding log files is time consuming.You will Disrupt Production Traffic – If you are doing any testing at scaleHadoop Does Not Understand Storm and Storm Does Not Understand Hadoop– Same language, different paradigms, different base classes.The SolutionWukong is a Domain Specific Language (DSL) designed specifically for data analytics, processing, and flow. It abstracts the platform that the analytics are running on (like Hadoop, Storm, or your local command line) and allows you to focus on writing analytics.A simple wukong script could easily be written in a few lines in a plain text file on your hard drive. It can then be run as a simple command line application, or used as a large Hadoop job or as part of a real-time Storm topology. The same analytics can be leveraged over and over again across your enterprise.Wukong enables its users to:Write and test code locally – from the command lineAvoid Disrupting Others – your deploy packDebug Rapidly – see results in real-timeSeamlessly move between contexts – like real-time (with Storm) and batch (with Hadoop)This allows for every rapid iteration of analytics, and allows your data scientists to be as agile as your business demands.QuestionsIf you develop real-time analytics, how would you run those against historical data?Does every developer in your organization have their own Hadoop cluster? Referenceshttp://www.infochimps.com/infochimps-cloud/tools/wukong/https://github.com/infochimps-labs/wukong/tree/3.0.0/
need to make shape be all the way up through project planplatform rollout has a lot more too it including QA/testing, analytics development, production rollout of first application, training, acceptance/success testingproduction should have infrastructure support, application/analytics support, SLA, managed services (training and acceptance testing could be part of the bottom part)
probably the diagram that shows the loop from local to cloud... except updated and made more powerful... maybe have the animations build as well
Call on the audience to figure out their first application and begin the path toward success by following the framework in this webinarIf big data projects are already underway, are you finding business value? do you feel like you are iterating through use cases? are your personnel utilizing their existing talents and strengths?
I invite you to let us know what your use case is, and we can help you evaluate which tools and architecture is appropriate to solve it. Now we are open to questions!