Big Data is the evolution of supercomputing for commercial enterprise and governments. Originally the domain of companies operating at Internet scale, today Big Data connects organizations of all sizes with discovery about their patterns, and insights into their business.
But understanding the differences between the plethora of new technologies can be daunting. Graph / columnar / key value store / document are all called NoSQL, but which is best? How does Hadoop play in this ecosystem - its low cost and high efficiency have made it very popular, but how does it fit?
In this webinar, we will explore:
The full spectrum of Big Data
Hadoop and MongoDB: friends or frenemies?
Differences between Systems of Record and Systems of Engagement
MongoDB customer examples of Systems of Engagement
4. 4
• Last 12 years (2002-Now) - Executive Consultant, on the board
and advisory board of several new software companies
including Big Data players such as MongoDB
• 10 Years (1992-2002) – Oracle, Group Vice President, Systems
Architecture and Technology, responsible for the server product
planning and rollout
• 16 years (1975-1992) – IBM, Planner, architect, and
development manager for DB2 product line at Silicon Valley
Lab and Austin Lab. Head of IBM‟s Database
architecture, strategy, and technology
Jnan Dash
5. 5
• Finally, some real innovation in DBMS
• MongoDB momentum is unprecedented!
• The changing landscape needs MongoDB
– “Internet scale” distributed operations + highly flexible
data model for agile development + open source
• Perfect fit for cloud, mobility, and big data
Why am I excited about MongoDB?
6. 6
• Big Data - Observations
• Evolution of Database Technology
• Hadoop+MongoDB
• Customer Examples
• Roadmap
• Summary
Agenda
7. 7
1. Thousand years ago – Experimental Science
Description of natural phenomenon
2. Last few hundred years – Theoretical Science
Newton‟s Laws, Maxwell‟s Equation,..
3. Last few decades – Computational Science
Simulation of complex phenomena
4. Today – Data-intensive Science
Scientists overwhelmed with data deluge
Unify theory, experiment & simulation
The Fourth Paradigm
8. 8
Internet Scale Commercial Supercomputing
• Originated with companies operating at Internet scale (to process
ever increasing #users and data)
– Yahoo in the 1990s, then Google, Facebook, Twitter
– They needed to do it quickly, economically, and affordably at scale
• Hadoop is the first commercial supercomputing software platform
– Works at scale, affordable at scale
• HPC was used for meteorology and engineering scientific super
computing. Big data is commercial equivalent of HPC
– Less about equations, more about discovery, patterns
• Many technologies have been around for decades
• Clustering
• Parallel processing
• Distributed file systems
11. 11
What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
12. 12
Big Data – the full spectrum
Transaction
Processing
Analytical
Processing
Data
Mining, Visualiz
ation, and
Integration
Tools
RDBMS OLAP/DW
DW
Appliance
Hadoop, Im
pala,..
NoSQL
NewSQL, In
-
Memory, Str
eam...
Online/Realtime Offline/Batch
15. 15
Data Management over the years
1960’s
File
Systems
1970’s
1st Generation
DBMS
Data as
Shared Resource
1980’s
Relational
Technology
Ease of Query
1990’s
New data types
OLAP/DW
Web Support
Unstructured Data
2005+
Big Data
Post-PC, Data
Deluge, 3Vs,
NoSQL
17. 17
MongoDB Features
• JSON Document Model
with Dynamic Schemas
• Auto-Sharding for
Horizontal Scalability
• Text Search
• Aggregation Framework
and MapReduce
• Full, Flexible Index Support
and Rich Queries
• Native Replication for High
Availability
• Advanced Security
• Large Media Storage with
GridFS
18. 18
Documents are Rich Data Structures
{
first_name: „Paul‟,
surname: „Miller‟,
cell: „+447557505611‟
city: „London‟,
location: [45.123,47.232],
Profession: [banking, finance, trader],
cars: [
{ model: „Bentley‟,
year: 1973,
value: 100000, … },
{ model: „Rolls Royce‟,
year: 1965,
value: 330000, … }
}
}
Fields can contain an
array of sub-documents
Fields
Typed field
values
Fields can
contain
arrays
20. 20
• Hundreds of thousands of records per second
• Fast response required
• Sometimes all data kept, sometimes just
summary
• Horizontal scalability required
Fast Moving Data
21. 21
• A machine generates a specific kind of data
• The data model is unlikely to change
• But there are so many different machines…
• Queryability across all types
Data is Structured, but Varied…
22. 22
• Event data written multiple times per second,
minute, or hour
• Tracking progression of metrics over time
Time Series Data
23. 23
Do More With Your Data
MongoDB
Rich Queries
• Find Paul’s cars
• Find everybody in London with a car
built between 1970 and 1980
Geospatial
• Find all of the car owners within 5km of
Trafalgar Sq.
Text Search
• Find all the cars described as having
leather seats
Aggregation
• Calculate the average value of Paul’s
car collection
Map Reduce
• What is the ownership pattern of colors
by geography over time? (is purple
trending up in China?)
{
first_name: „Paul‟,
surname: „Miller‟,
city: „London‟,
location: [51.524,-0.087],
cars: [
{ model: „Bentley‟,
year: 1973,
value: 100000, … },
{ model: „Rolls Royce‟,
year: 1965,
value: 330000, … }
}
}
25. 25
Enterprise Big Data Stack
EDWHadoop
Management&Monitoring
Security&Auditing
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Online Data Offline Data
26. 26
MongoDB & Hadoop
• Multi-source analytics
• Interactive & Batch
• Data lake
• Online, Real-time
• High concurrency & HA
• Live analytics
Operational Analytical
MongoDB
Connector for
Hadoop
27. 27
Hadoop Is Good for…
Risk Modeling Churn Analysis
Recommendation
Modeling
Ad Targeting
Transaction
Analysis
Trade
Surveillance
Network Failure
Prediction
Search Quality Data Lake
28. 28
MongoDB Is Good for…
Single View Mobile Apps Fraud Detection
Customer Data
Management
Content
Management &
Delivery
Database-as-a-
Service
Product & Asset
Catalogs
Internet of Things
Social &
Collaboration
30. 30
Many more examples
Big Data Product & Asset
Catalogs
Security &
Fraud
Internet of
Things
Database-as-a-
Service
Mobile
Apps
Customer Data
Management
Single
View
Social &
Collaboration
Content
Management
Intelligence Agencies
Top Investment and
Retail Banks
Top US Retailer
Top Global Shipping
Company
Top Industrial Equipment
Manufacturer
Top Media Company
Top Investment and
Retail Banks
32. 32
• Makes MongoDB a Hadoop-enabled file system
• Full use of MongoDB‟s indexes
• Read and write to live data, in-place
• Copy data between Hadoop and MongoDB
• Full support for data processing
– Hive
– MapReduce
– Pig
– Streaming
– EMR
MongoDB+Hadoop Connector
MongoDB
Connector for
Hadoop
33. 33
Customer Example – MetLife
Customer
Service
• Insurance policies
• Demographic data
• Customer web data
• Call center data
• Real-time churn detection
• Customer action analysis
• Churn prediction
algorithms
Churn Analysis
MongoDB
Connector for
Hadoop
34. 34
Customer Example - eCommerce
Travel
• Flights, hotels and cars
• Real-time offers
• User profiles, reviews
• User metadata (previous
purchases, clicks, views)
• User segmentation
• Offer recommendation engine
• Ad serving engine
• Bundling engine
Algorithms
MongoDB
Connector for
Hadoop
36. 36
• Big Data covers a wide spectrum
– Volume, Velocity, Variety
– Hence the mythical equation Big Data = Hadoop
• Enterprises are more concerned about Variety
– MongoDB provides the best platform
• Hadoop and MongoDB are complimentary
– MongoDB for operational workloads
– Hadoop for analytical workloads
Summary
Notas do Editor
MongoDB provides agility, scalability, and performance without sacrificing the functionality of relational databases, like full index support and rich queriesIndexes: secondary, compound, text search, geospatial, and more
We have all these fantastic machines… they give the same metrics they used to, but now they transmit the data. We have metrics about metrics, and we need a place to store the data. We need a place to understand what the data means.
This is where MongoDB fits into the existing enterprise IT stackMongoDB is an operational data store used for online data, in the same way that Oracle is an operational data store. It supports applications that ingest, store, manage and even analyze data in real-time. (Compared to Hadoop and data warehouses, which are used for offline, batch analytical workloads.)
Makes MongoDB a Hadoop-enabled file systemRead and write to live data, in-placeCopy data between Hadoop and MongoDBUses MongoDB indexes to filter dataFull support for data processingHiveMapReducePigStreaming
What each of these has in common is that they’re retrospective: they’re about looking at the past to help predict the future. The learnings from these Hadoop applications end up being applied by a different technology. This is where MongoDB comes in.
Customer Data Management (e.g., Customer Relationship Management, Biometrics, User Profile Management)Product and Asset Catalogs (e.g., eCommerce, Inventory Management)Social and Collaboration Apps: (e.g., Social Networks and Feeds, Document and Project Collaboration Tools)Mobile Apps (e.g., for Smartphones and Tablets) Content Management (e.g, Web CMS, Document Management, Digital Asset and Metadata Management)Internet of Things / Machine to Machine (e.g., mHealth, Connected Home, Smart Meters)Security and Fraud Apps (e.g., Fraud Detection, Cyberthreat Analysis)DbaaS (Cloud Database-as-a-Service)Data Hub (Aggregating Data from Multiple Sources for Operational or Analytical Purposes)Big Data (e.g., Genomics, Clickstream Analysis, Customer Sentiment Analysis)