Learn how to leverage MongoDB and Big Data technologies to derive rich business insight and build high performance business intelligence platforms. This presentation includes:
- Uncovering Opportunities with Big Data analytics
- Challenges of real-time data processing
- Best practices for performance optimization
- Real world case study
This presentation was given in partnership with CIGNEX Datamatics.
SQL Database Design For Developers at php[tek] 2024
MongoDB Webinar Case Study Big Data Analytics
1. CIGNEX Datamatics Confidential www.cignex.com
Webinar:
Faster Big Data Analytics with MongoDB
Case Study: Building Large Scale Data Processing and Data Analysis Platform using
MongoDB
Date: 06th April 2016
Speakers:
Buzz Moschetti
Enterprise Architecture and Special Programs
MongoDB
Anurag Seth
VP, Big Data Analytics & IoT Practice
CIGNEX Datamatics
2. CIGNEX Datamatics Confidential www.cignex.com
Buzz Moschetti,
Enterprise Architecture and Special Programs
MongoDB
Buzz works with F1000 companies to help them design
next-generation solutions and develop strategies for
overall technology transformation. He is also the CTO of
the partner program at MongoDB and a liason to
Engineering, Product Management, and Marketing
groups.
– 25+ years experience in the field, mostly in financial
services as CAO of the Investment Bank at
JPMorganChase and Bear Stearns before that
Anurag Seth,
VP, Big Data Analytics & Internet of Things (IoT)
Practice, CIGNEX Datamatics
Anurag has unique blend of technology expertise from
deep tech VLSI chip design to complex high performance
algorithmic software development in EDA (Electronic
Design Automation) to embedded system design to
predictive modelling & Big Data Analytics deployment
for compelling use-cases (including IOT).
– 25 years of strong experience in technology
development & delivery – product as well as services
across VLSI/EDA, Healthcare , Enterprise Big Data
Implementations & IOT
– Has served on board of the VLSI Lab at IIT
Kharagpur, been the general chair of the
International conference on VLSI Design &
Embedded Systems (2009) and still continues to
serve on the steering committee of the conference
2
Who are we ?
3. CIGNEX Datamatics Confidential www.cignex.com
• Big Data Analytics: Opportunity & Challenges
• Case Study: Building Large Scale Data Processing and Analysis Platform using MongoDB
– Business Needs
– Our Approach
– Solution Architecture
– MongoDB - A Great Fit for Data Processing and Analytics
– MongoDB Performance Tuning - Our Holistic Approach
– Recommended Best Practices
• Why MongoDB ?
• Why CIGNEX Datamatics ?
Topics
3
4. CIGNEX Datamatics Confidential www.cignex.com
Over 88% of data sources and
types are not being analyzed..
4
Big Data Analytics: Business Opportunities
Transactional &
Application Data
Machine Data
Enterprise
Content
Social Data
Reduce Operational
Costs
Improved Risk
Management
Many
more..
Volume
Structured
Velocity
Semi-structured
Variety
Un-structured
Variety
Un-structured
Sensor Data
Velocity
Semi-structured
5. CIGNEX Datamatics Confidential www.cignex.com
The organizations that uses Big Data
Analytics to integrate, process and
analyze these data sources are up to 25x
more likely to outperform their
competitors.
5
Big Data Analytics: Business Opportunities
Improve Process Efficiency
(Sales, Marketing, Finance, Operations)
Product/Service
Innovation
Monetize
Information
Improved
Collaboration
Improve customer
experience
Reduce Operational
Costs
Improved Risk
Management
6. CIGNEX Datamatics Confidential www.cignex.com
• Getting the right data & Infra
architecture for performance &
scalability
• Leverage investments in existing
technologies
• Integrating multi-channel & variety of
data sources at the modern volume
• Data quality & accuracy challenges
• Big data technologies are evolving too
quickly to adapt
• Scarcity of skills and capabilities
6
Big Data Analytics - Implementation to Production Challenges
• Hard ROI from Big Data?
– Identify & monetize existing & new
Data Streams
• Turn-around time for big data
(predictive modelling) deployments
• Difficult to make big data fit-for-
purpose (uncertainty), assess the level
of trust, and ensure security & privacy
• Lack of domain centricity
Technical Business
7. CIGNEX Datamatics Confidential www.cignex.com
Case Study:
Building Large Scale Data Processing and Analysis Platform using MongoDB
7
8. CIGNEX Datamatics Confidential www.cignex.com
• SaaS based sales analytics platform that acquires, processes and enriches accessible
public data to deliver data-driven customer and business insights that:
– Enhances efficacy of customer acquisition
– Improve operational efficiency
– Competitive & complementary selling opportunities
– Determine buying propensity, influencers & decision makers
8
Business Need
PUBLIC DATA ACQUISITION SOCIAL LISTENING CUSTOMER/BUSINESS INSIGHTS
9. CIGNEX Datamatics Confidential www.cignex.com9
Our Approach
Segment data by influential characteristics as the
best variables to use, use case centric
2. DATA PREPARATION
Evaluate and combine multiple models or
techniques that lead to higher efficiency
3. MODELING
Dashboard for Big Data Analytics
4. ANALYTICS
Define data sources that could
influence the outcome.
1. DATA ACQUISITION
Extensive multi-step rule-based ETL process which involves de-
duplication, geo-coding, smart-filtering over huge dataset etc.
Machine Learning ?
Augment with ML algorithms in
the longer run.
Semantic associations ?
Leverage the power of semantic associations
(NLP for Entity Extraction, Entity Associations)
to process millions of entities & implement
complex business rules for data enrichment and
refinement
Social listening that integrate 20+ Open public data sources using REST APIs.
Store and manage 1billion+ objects expected to be ingested and processed by
leveraging elastic scalability of AWS cloud compute
Front-end application with
intuitive search/mining and
dashboard with graphical
visualization of thousands of
records with faster response
time.
10. CIGNEX Datamatics Confidential www.cignex.com10
Solution Architecture (High Level)
Data Processing Data Visualization
Social Data
Market Data
External Data
Location Data
Data Enrichment Data Processing Cluster
Customized
Core Java based
ETLs and Java scripts
Third Party ETL Cluster (one of these)
Front-End Application
Full Text Search Engine
(one of these)
MongoDB Cluster
Customer Data
Amazon Cloud Hosting (Elastic Cloud Computing - EC2)
MongoDB
Secondary
MongoDB
Primary
MongoDB
Secondary
MongoDB Cluster
MongoDB Primary MongoDB Secondary
MongoDB Secondary
Jasper/ Tableau/ C3/D3.js
Visualization
Front End Application
Framework
11. CIGNEX Datamatics Confidential www.cignex.com
Requirement MongoDB Features
• Support multiple data processing pipelines
– Via ETL Tool
– Via Custom Code
– Via Custom Scripts
• Integration with leading data integration tools – Alteryx,
Talend, Pentaho
• Java Driver to create custom business logic
• Support for server side JavaScript to trigger custom business
Logic
• Sustain write throughput with increasing data
volumes
• Sharding to scale out horizontally and distribute load
• WiredTiger storage engine (>=Version 3) with features such
as document level concurrency facilitating excellent write
performance, optimal memory usage, data compression for
faster data access and efficient storage
• Provide low latency
• Support large number of concurrent user and
sustain response times
• Sharding to route/distribute read requests to separate nodes
• Data & index compression features in in WiredTiger storage
engine facilitate better performance
• Store indexes on separate mounts and improve read
throughput
11
MongoDB - A Great Fit for Data Processing and Analytics
12. CIGNEX Datamatics Confidential www.cignex.com13
Implementation Challenges
Implementation Challenges Solution
• Unifying different Data Processing
components(ETL, Custom Code) & overall ETL
efficiency
• Created custom / configurable orchestration engine which
allows full / partial execution of data processing steps
• Created a dashboard which shows monitoring of the
execution steps – allows re-start from anywhere in the
multi-step ETL process
• Performance Tuning of Data Processing &
Analysis frameworks
• Holistic approach to performance tuning (Covered in detail
later)
• Serve different data analysis use cases (Full
Text Search, Sub second response times,
Persistent Data storage)
• Utilize complimentary technologies
– MongoDB for persistent storage, horizontal scalability,
analytics
– Elastic Search or Solr for full text search use cases
• Data Quality • We initially underestimated the extent of quality issues
with the data (more so, since most of the data was public).
During the execution, we budgeted and hired a dedicated
experienced BA who assumed responsibility of data quality
& cleaning-up
13. CIGNEX Datamatics Confidential www.cignex.com
Best Practices
To be successful, you must address your overall design and
technology stack, not just schema design.
14
14. CIGNEX Datamatics Confidential www.cignex.com
A Holistic Approach to MongoDB Performance Tuning
Infrastructure Layer
Storage Engine
Data Model
Query Language
Application Layer
Cluster Sizing & Configuration
• Right Size
• Optimum Price benefit
Replica set sizing, Sharding
Map to use case, R/W Heaviness
Access pattern based Schema
Indexes, Query Tuning
• MongoDB Drivers
• Architecture & Design
15
15. CIGNEX Datamatics Confidential www.cignex.com
• Infrastructure Sizing:
– SSDs provide VERY SIGNIFICANT performance boost specially for write-heavy
workloads
– Investment in CPU with more cores often delivers more benefits than
investing in faster CPU
– Ensure that your working-set fits in the RAM (use db.serverStatus() command
to view an estimate of the the current working set size)
– Evaluate thoroughly whether journaling is needed. Remember that, with
journaling turned on MongoDB ends up using double the RAM.
• Cloud Infrastructure Capacity Planning:
– Leverage cloud platform with the right instance type by evaluating access
patterns, workloads & storage requirements.
16
A Holistic Approach to MongoDB Performance Tuning
Future Scalability
Query Tuning
Design Approach,
Schema Design
OS & Storage
Optimisation
Infrastructure Sizing
& Capacity Planning
16. CIGNEX Datamatics Confidential www.cignex.com
• Storage Optimization:
– Recommend use of WiredTiger as storage engine
• OS Optimization:
– Disable NUMA – non uniform memory access- not good for operational
database (configure a memory interleave policy )
– Don’t use Huge Pages virtual memory pages – mongo performs better with
normal virtual memory pages
– Readahead size should be set to 32 (use the blockdev --setra <value>)
– Increase ulimit (>20,000)
– Turn off atime for the storage volume containing database files
17
A Holistic Approach to MongoDB Performance Tuning
Future Scalability
Query Tuning
Design Approach,
Schema Design
OS & Storage
Optimisation
Infrastructure Sizing
& Capacity Planning
17. CIGNEX Datamatics Confidential www.cignex.com
• Schema Design:
– Always invest time in schema design, dynamic schema only means
additional flexibility !!
– Don’t store empty fields in documents
– Create the indexes very carefully. More indexes != more performance.
Indexes not fitting not fitting in RAM are often counterproductive for
performance
– No Index creation on the FLY
– Index creation in designated “Maintenance Window“
– Use Bulk API feature whenever possible. We have often witnessed
significant gains in the write throughput
– Use index optimizations available in the WiredTiger storage engine
18
A Holistic Approach to MongoDB Performance Tuning
Future Scalability
Query Tuning
Design Approach,
Schema Design
OS & Storage
Optimisation
Infrastructure Sizing
& Capacity Planning
18. CIGNEX Datamatics Confidential www.cignex.com
• Scalability:
– Horizontal scaling through sharding
– Use MongoDB aggregation framework
– Always keep the NFRs on top from design to implementation.
• Query Tuning:
– Effective use of indexes to support queries
– Avoid negation in queries & scatter-gather queries
– Reduce query result set size where-ever possible using limit and
projections
– Effective & frequent use of MongoDB query profiler & explain command
– Leverage each utility provided by MongoDB - mongoperf, mongosniff,
mongostat, mongotop
19
A Holistic Approach to MongoDB Performance Tuning
Future Scalability
Query Tuning
Design Approach,
Schema Design
OS & Storage
Optimisation
Infrastructure Sizing
& Capacity Planning
19. CIGNEX Datamatics Confidential www.cignex.com
• Simplified solution architecture with the right technologies for the use case
• Performance Tuning & scalability initiated from Day 1
– Holistic approach to performance tuning reduced response times from ~ 2- 3 minutes to
~ 3 -5 seconds
• Proprietary & Open Source can coexist
– Leverage existing investments proprietary tools and Open Source technologies that
reduce licensing costs
– Leverage open source java script components for visualization
• Team composition played critical – Need complimentary skills:
– Solution Architecture | Dev-Ops | Business Analysis/Data Science
• Elastic compute storage
– Leverage AWS cloud features of elastic scalability to upsize/downsize compute power
based on data processing workloads.
20
Benefits Delivered
20. CIGNEX Datamatics Confidential www.cignex.com
MongoDB Vital Stats
500+ employees 2000+ customers
Over $311 million in funding
Offices in NY & Palo Alto and
across EMEA, and APAC
21
21. CIGNEX Datamatics Confidential www.cignex.com
The best way to run
MongoDB
Automated.
Supported.
Secured.
Features beyond those in the
community edition:
Enterprise-Grade Support
Commercial License
Ops Manager or Cloud Manager Premium
Encrypted & In-Memory Storage Engines
MongoDB Compass
BI Connector (SQL Bridge)
Advanced Security
Platform Certification
On-Demand Training
MongoDB Enterprise Edition
22
24. CIGNEX Datamatics Confidential www.cignex.com
7x-10x Performance, 50%-80% Less Storage
MongoDB 3.0 Set The Stage…
How: WiredTiger Storage Engine
• Same data model, query language, & ops
• 100% backwards compatible API
• Non-disruptive upgrade
• Storage savings driven by native
compression
• Write performance gains driven by
– Document-level concurrency control
– More efficient use of HW threads
• Much better ability to scale vertically
MongoDB 3.0MongoDB 2.6
Performance
25
25. CIGNEX Datamatics Confidential www.cignex.com
MongoDB Sweet Spot Use Cases
Big Data Product & Asset
Catalogs
Security &
Fraud
Internet of
Things
Database-as-a-
Service
Mobile
Apps
Customer Data
Management Single View
Social &
Collaboration
Content
Management
Intelligence
Agencies
Top Investment
and Retail Banks
Top Global
Shipping
Company
Top Industrial
Equipment
Manufacturer
Top Media
Company
Top Investment
and Retail Banks
Complex Data
Management
Top Investment
and Retail Banks
Embedded /
ISV
Cushman &
Wakefield
26
26. CIGNEX Datamatics Confidential www.cignex.com27
CIGNEX Datamatics - Established in 2000, USA
12+ Open Source
Framework/ Components#1 Pure Play Open
Source Services Company
15 Open Source
Books Authored
Global Offices
13+Business Engagement
Platforms4+
Open Source
Community Contributions5000+Open Source
Implementations500+Open Source
Consultants500+
Portals, Content & Collaboration
Portals
Enterprise Integration
Identity Relationship Management
Enterprise Content Management
Document Management
Web Content Management
Learning/Knowledge Management
Imaging and Scanning - OCR/Digitization
Enterprise Search
Business Process Management
E-Commerce
B2B e-Commerce
B2C e-Commerce
Internet of Things (IoT)
Big Data Analytics
Data Integration
Information Delivery
Data Analysis
Open Source Solutions
Business Engagement Platforms
27. CIGNEX Datamatics Confidential www.cignex.com28
At Glance – CIGNEX Datamatics Big Data Analytics & IoT Case Studies
Improve performance through real-time
intelligence by efficient device
management. & issue identification
GPS Services Company Networking Company
Increase customer satisfaction &
revenue due to uninterrupted video
experience anywhere anytime on any
device
Modernization of legacy Quote Portal
resulting into competitive advantage –
Quote in 5 minutes
Insurance Company
First mover advantage with timely
launch of Sentiment and Trending
Analysis service
SaaS Start-up Company B2B Market Intelligence Services
100% Increase in Conversion Rate with
Single View of Business and Market
Intelligence
E-Learning Community Portal
7x-10x Efficient User Data Management
with Improved application performance
and data security
28. CIGNEX Datamatics Confidential www.cignex.com29
Questions ?
Test Drive Big Data Analytics & IoT
Engage us for Proof-of-Concept (PoC)
Website: www.cignex.com | Email: info@cignex.com