"Big Data Use Cases" was presented to Lansing BigData and Hadoop Users Group Kickoff meeting on 2/24/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of CDH 5.3, HDP 2.2 and AWS cloud
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Big Data Use Cases
1. Big Data Use Cases
InSemble Inc.
http://www.insemble.com
2. Agenda
What is Big Data ?1
Technical Use Cases and Demo4
Hadoop Ecosystem & Business Use cases3
Relevance to your Enterprise2
Q and A with Cloudera5
3. Big Data Definitions
• Wikipedia defines it as “ Data Sets with sizes beyond the
ability of commonly used software tools to capture, curate,
manage and process data within a tolerable elapsed time
• Gartner defines it as Data with the following
characteristics
– High Velocity
– High Variety
– High Volume
• Another Definition is “ Big Data is a large volume,
unstructured data which cannot be handled by traditional
database management systems
4. Why a game changer
• Schema on Read
– Interpreting data at processing time
– Key, Values are not intrinsic properties of data but chosen by
person analyzing the data
• Move code to data
– With traditional, we bring data to code and I/O becomes a
bottleneck
– With distributed systems, we have to deal with our own
checkpointing/recovery
• More data beats better algorithms
5. Enterprise Relevance
• Missed Opportunities
– Channels
– Data that is analyzed
• Constraint was high cost
– Storage
– Processing
• Future-proof your business
– Schema on Read
– Access pattern not as relevant
– Not just future-proofing your architecture
7. Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
8. Big Data Journey
!Real time Insight from all channels
!IT is key differentiator for your business
!Perfect alignment of Business and IT
!Ad Hoc Data Exploration
!Batch, Interactive, Real time use cases
!Predictive Analytics, Machine Learning
!Consolidated Analytics
!ETL
!Time Constraints
!Security standards defined
!Governance Standards Defined
!Integrated with the Enterprise
!Evaluate Business Benefits
!Understand Ecosystem
!Identify Platform
Aware of Benefits
Execute
Expand
Managed
Optimized
- Scout for Opportunities
- Pilot project
- Multiple Use cases
- Governance Model
- Core competency
Journey Over Time
BusinessValue
Effects
GREAT
GOOD
9. 9
Insurance Domain – Case Study
source: Cloudera( Three-Customer-Case-Studies_Industry-Brief.pdf
Solution
• Cloudera Enterprise
• Apache Hive/Impala
• SQOOP
• Coexist with Enterprise Warehouses &
Mainframe
REQUIREMENTS
• Customized Plans based on multiple data points
• Lifestyle, health patterns, habits, preferences
• Find correlations from digitizing massive amounts of data
• Traffic patterns, demographics, weather
• Run analytics on multiple states simultaneously
BENEFITS
• Run descriptive models across historical data
from all states
• Customized products catered to
individual behaviors and risks
• Differentiated Marketing Offers
10. Common Use Cases
Detail Records, Time Constraints1
Sentiment Analysis, Fraud Detection4
Recommendation Engines, Insurance Underwriting3
Consolidated View, 360 degree View2
Personalized Marketing, Products5
12. General Thoughts
• Technology in hyper growth phase
• Complex
• Tools/Productivity/Monitoring products
evolving
• Pilot Project
• Incremental Journey
13. Technical Use Case: Managing
Hadoop Cluster
• Ambari vs Cloudera Manager
• Both provision, manage and monitor hadoop cluster
• Ambari
• Open Source
• Based on existing open source projects such as Puppet,
Ganglia and Nagios
• Cloudera Manager
• Proprietary tool but more mature
• As management tool, do we really need OSS?
• Rolling upgrades and manage multiple clusters
17. Other considerations
• Insert, update, and delete with full ACID
support
• Available since hive 0.14 https://issues.apache.org/
jira/browse/HIVE-5317
• Support for nested data structure
• Fault tolerance
• Work with certain file formats (Avro, LZO
compression)
• Integrate SQL on hadoop with other big data
use cases.
18. Demo - Hadoop cluster in AWS
• Total 6 EC2 machine, type t2.medium
• RHEL 6.5, 3.75G Memory, 10G hard drive
• 5-node Hadoop cluster
• Public data set downloaded from
https://data.cityofchicago.org
19. Demo
• Chicago Crime data from 2009 to present
• 2 million plus records
• Dangerous communities in Chicago (Hive vs
Hive on Tez vs Impala)
• Use Tableau to connect to Hadoop cluster
• Crime counts based on crime type
• Homicide count by Year
• dangerous community
• Homicide Map