Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Scaling up with Cisco Big Data: Data + Science = Data Science
1. Data + Science = DataScience
P r e s e n t e d b y :
eRic Choo
Scaling up with Cisco Big Data
2. Big Data Products-Solutions Stack
Infrastructure - Servers, Storage, Data Protection & Retention Solutions
Business Intelligence
Data Mining & Business Analytics
Big Data Virtualization & Systems Integration
7. • MapReduce is powerful, but hard
• Spark aims to be both powerful and easy for processing
• How does it do it?
– A more generalized form of MapReduce
– Elements transformed in parallel
– Memory Cache-ing
– Supports Python & Scala, along with Java
What is Spark? An Execution Engine on Top of Hadoop
Map ReduceInput Output
Reduce
Input
Output
8. Spark advantages for the end user
Faster Development & Data Pipelining
• Simple, easy-to-understand programming
abstraction with an interactive shell
• APIs for Java, Python and Scala
• Enables reuse of code across batch,
interactive and streaming applications
e.g. calling machine learning library
routines in Spark SQL
In-Memory Performance
• General-purpose execution graphs
• In-memory pipelining to achieve maximum
performance without persisting
intermediate results to disk
Popular use cases include ETL, Machine Learning and Real-time Analytics
10. Hadoop with Speed Advantages - Example
Logistic regression in Hadoop
MapReduce and Hadoop with Spark
Hadoop MR
Hadoop w/ Spark
Up to 10x faster on disk,
100x faster in memory
15. Spark on MapR Advantages
World-record performance on disk coupled
with in-memory processing advantages
High Performance
Industry-leading enterprise-grade features for
the Spark stack
Enterprise-grade Applications
Strategic partnership with Databricks to
ensure enterprise support for the entire stack
24/7 Best-in-class Global Support
MapR-DB + Spark on one Hadoop cluster
allows for real-time as-it-happens analytics
Operational DataStore + Spark
18. Cisco: Security Intelligence Operations
Sensor data lands in MapR
Spark Streaming on MapR for
first check on known threats
Data next processed on GraphX
and Mahout
Additional SQL querying done
via Spark SQL and Impala
Complex
Data Pipelining
without MapReduce
19. Industry Leading Ad-Targeting Platform:
Real-time Decisions
High performance analytics
over MapR-DB
Load from MapR-DB table into
RDD to augment scoring
Results stored back in MapR-DB
for other applications
Real-time Analytics
over NoSQL
20. Addressing Health
Care Regulations
Patient information in MapR-DB
combined with clinical records to
compute re-admittance
probability
Process uses Spark with
transactional data in MapR-DB
Deploy home health services to
prevent re-admittance
Real-time Analytics
over NoSQL
21. Streaming Use Cases
• Manufacturing & Internet of Things: Real-time, adaptive analysis of machine data (e.g.,
sensors, control parameters, alarms, notifications, maintenance logs, and imaging results)
from industrial systems (e.g., equipment, plant, fleet) for visibility into asset health, proactive
maintenance planning, and optimized operations.
• Fraud Management: Real-time analysis of business communication and accounting
transactions to detect unusual activities.
• Marketing & Sales: Analysis of customer engagement and conversion, powering real-time
recommendations while customers are still on the site or in the store
• Customer Service & Billing: Analysis of contact center interactions, enabling accurate
remote trouble shooting before expensive field technicians are dispatched
• Information Technology: Log processing to detect unusual events occurring in stream(s) of
data, so that IT can take remedial action before service quality degrades
Real-time Analytics
over Streaming
27. Data Science
• What is Data Science
– Extraction of knowledge from data
employing math, statistics and information
theory (Probability model, machine learning
and etc.)
28. Source: Wikipedia
Data Analytics/Science Development Cycle
Challenges
• Data Science knowledge
required
• Multiple models for testing
• Multiple ways of tuning testing
data
• Multiple iterations of testing
• Stabilizing results
Benefits of Automation
• Data Science knowledge built into
platform
• Automated testing of multiple
models
• Selection of most accurate models
• Reduced iterative testing time
• Effective use of Data Science
Resources
• Higher productivity and lower
cost
30. Supervised Learning
• Labelling of data according to a labelled training set
• Example
– I know that it will rain when
• Sky is dark
• More moisture in the air
• Its is near raining session
– Question:
• In the current weather will it rain
• Type of algorithms
– Naive Bayes
– Linear Regression
– Decision Trees
31. Unsupervised Learning
• Example:
– I have a set of data collected regarding weather
– I have multiple other set of data that are non
related to the weather. ie. forest fire data from
nearby region, etc.
– Are there any relation between the data set?
• Type of algorithms
– K-mean
– Fuzzy Clustering
39. Text Analytics
CLUSTER 1 CLUSTER 2
CLUSTER 3 CLUSTER 4
Hadoop
Text Documents
MAHOUT (Data Science Tool)
MapReduce
40. Categorizing into Topics/Stories
CLUSTER 1 CLUSTER 2
CLUSTER 3 CLUSTER 4
CONSTRUCT STORIES
TOP TERMS CL 1
Technology
3D Printing
Steve Jobs
Sports Wear
…
TOP TERMS CL 2
United Nations
Dogs
Camera
Internet of Things
…
CATEGORY : INNOVATION CATEGORY : SECURITY
44. Sentiment Analysis
TWITTER (DATA IN JSON FORMAT)
Field Value
For Country United States
By individual State
Analyze Tweets
Objective : To find out the level of happiness of a State in USA
46. Sentiment Score Computation
San Francisco
Los Angeles
New York
Chicago
Boston
San Diego
Score at tweet level for CA
Score at tweet level for CA
Score at tweet level for CA
Summing
up the
tweet level
scores for
each state
53. Data Science Automation
DataRobot is a platform that lets Data Scientist automates the entire model life
cycle process which is very serialized and time consuming. This life cycle
includes:
1. Pre-processing and feature engineering
2. Algorithm identification to build predictive model(s)
3. Training, testing, and validating of models
4. Building of deployment scripts for model deployment to provide business
insight
56. Quantium captures new niche in data analytics market
MapR Distribution for Apache Hadoop and
Cisco UCS cut query time by 92 percent,
improve accuracy of results
“ With the Cisco-MapR platform, Quantium has positioned itself to stay well ahead of our
competitors for the foreseeable future.” https://marketplace.cisco.com/catalog/products/3344
- Alex Shaw, Head of Technology Operations, Quantium
57. Hosted on Cisco infrastructure, MapR
Distribution for Hadoop meets Quantium’s strict
requirements
To meet its challenges, Quantium assembled a team of data scientists from across the business. The team created a set of requirements
and evaluated the available software and hardware solutions on the market.
“Decisions about the new platform would affect Quantium’s business for years to come, so we invested a significant amount of time
and money in the selection process,”
- Alex Shaw, Quantium’s Head of Technology Operations
58. “The POC demonstrated that MapR performs better than the competition. The
MapR file system gives us maximum control over how we store information within
the data volumes and has good security features.”
- Alex Shaw, Quantium’s Head of Technology Operations
• Quantium realized that a big data solution was needed, not only because
of the data volume but also the heavy analytical requirements.
• While the team chose Hadoop as the big
data software solution, they still needed
to choose the best distribution from
among the top-tier Hadoop vendors (see
figure 1).
• The first stage of the process, a thorough
analysis of features and benefits,
narrowed the field to MapR and one
other competitor.
59. • Performance of new platform exceeds targets
• Unique business model outpaces competitors
• Greater innovation, shorter time to market
“Having access to external data sets to combine with
our clients’ data distances us from everybody else in
this space,”
“We have a lot of smart people who have been
hamstrung by technology and its ability to implement
their ideas. Now they have improved ways of executing
analytics which opens up the ability to create new and
innovative solutions for our clients”
- Alex Shaw, Quantium’s Head of Technology Operations
60. • Scaling to accommodate business growth
• Multi-tenancy model safeguards client information
“MapR incorporates data partitioning
via the Volumes feature, which allows us
to logically segregate individual data
sets while optimizing data storage for
optimum performance,”
- Alex Shaw, Quantium’s Head of Technology
Operations
61. Extending the Quantium approach to new
markets
“We’ve expanded the range of problems that we can
solve, enabling our clients to grow their business by
interacting with each of their customers as individuals
with specific wants and needs,”
“With the Cisco-MapR platform, Quantium has
positioned itself to stay well ahead of our competitors
for the foreseeable future.”
- Alex Shaw, Quantium’s Head of Technology Operations
63. World's Largest Biometric Identity System: Aadhaar Experience
• 1.2 billion residents
– 640,000 villages, ~60% under $2/day, ~75% literacy,
– <3% pays Income Tax, <20% banking,
– ~1 billion mobile connections
– ~300-400m migrant workers
• $50 billion direct subsidies every year!
– Residents have no standard verifiable identity
– Most programs plagued with ghost and multiple
identities causing leakage of 20-40%
64. Demographic Data
• Compulsory data:
– Name, Age/Date of Birth,
Gender and
– Address of the resident
• Optional data:
– Mobile number
– Email address
Biometric Data
Photograph
All 10
fingerprints
Both Iris
World's Largest Biometric Identity System: Aadhaar Experience
12-digit Aadhaar Number
Unique, lifetime, biometric based identity
69. Big Data Implementation Road Map
PLAN BUILD MANAGE
Understand
ExploreModel
Assess
Discovery
Workshop
Proof of
Concept
Validation
Plan, Design,
Implement
Support /
Managed
Services
70. Please take some time to fill up the
feedback form and the Question Sheet