Many organizations today, due to regulatory compliance or other needs, are finding it necessary to archive large volumes of data into long-term storage. Learn how MongoDB provides a flexible, efficient, scalable, long-term document storage that can adapt to your organization's changing needs over time. A case study from US federal government agency with 130 legacy applications that needed to be archived and integrated into a federated view of archive and real-time operational data. Regulations in many industries (eg HIPAA, SOX, Basel 3, FATCA etc) are driving the need for data retention and the need for query processing across archives and operational data.
Complex Legacy System Archiving/Data Retention with MongoDB and Xquery
1. Legacy System Archiving With XML, XQuery
and MongoDB
Dave Watson
SVP, iWay Software
@watsondaveny
watson.dave@gmail.com
2. Agenda
XML Archive Overview and Business Use Cases
XML Archive Technical Discussion
Copyright 2009, Information Builders. Slide 2
3. iWay Archive
What is XML Archive
An extension of ESB for archiving data
Leverage ESB process-oriented integration and data
federation capabilities
Long term data retention
Large repository, large index (Big Data)
Search and retrieve capabilities (High performance)
Business use examples
Satisfy regulatory requirement
e-Discovery (e.g. research, forensic)
Business analytics
Copyright 2009, Information Builders. Slide 3
4. Archive – Solving Business Needs
Examples of Business Requirements:
Regulations / Reqrs Example Data Retention
Federal Record Patient health records 75 years (after last
Retention Requirement episode of care)
FDA 21 CFR Part 11 Clinical trials and FDA 35 years
approval
HIPAA (Healthcare) Pediatric medical records 21 years
Sarbanes-Oxley (public Audit 7 years
companies)
SEC 17a-4 (Financial Account records 6 years
services) Corporate documentation Life of the enterprise
Research Life science Long-term
Analytics Financial / Legal Long-term
Copyright 2009, Information Builders. Slide 4
5. Archive – Types of Data
Can handles all types of data, for example:
Electronic Documents
Word, Excel, EDI, HL7, XML, …
Applications
ERPs, CRMs, SAP, SFDC, …
Database Data
IMS, DB2, Oracle, Sybase, SQL Server, MUMPS, …
Electronic Files
VSAM, Unix, Logs, …
Email
Outlook, Lotus Notes
Others
Multimedia files, Paper, Blueprints, Forms, Claims, …
ESB adapter components can be used to connect to the different types of
data.
6. Archive – Archiving Needs
Examples of Archiving Requirements:
Archive Requirements
Policy Based – Logical selection of DB records/transactions to be archived
Store very large amounts of data in archive
Keep data for a very long periods of time
Become independent from Applications/DBMS/Systems – future proof
Protect authenticity of data – regulation and compliance
Access archived data when needed / as needed
Quickly search huge numbers of archived documents
Discard data after retention period – regulation and compliance
Copyright 2009, Information Builders. Slide 6
7. Archive – Example Business Use Case
Store 75 years worth of patient data
Diverse Sources
XML
MUMPS
Oracle
HL7
Support archive, query and integration scenarios
XML to remain unchanged and exist outside the data store
Ability to query documents
Ability to retrieve original XML or part of XML using XQuery
Ability to integrate XML archived data in federated services
with operational sources (e.g. MUMPS, HL7, Oracle)
Copyright 2010, Information Builders. Slide 7
8. Archive – Example Business Requirements
Highly scalable high performance document
management database
Easily integrates into a ESB architecture
Multi-threaded parallel processing
Distributed processing
Just another data source along with, e.g., Oracle and
MUMPS databases
Leverage ESB Tools for process orchestration,
process monitoring, data mapping/transformation,
security and data aggregation capabilities.
Implementation and vendor neutral – archived data (e.g.
XML) stored in the operating system‟s native file system
Copyright 2007, Information Builders. Slide 8
10. Overview
Highly configurable ESB Java application that can be
customized to specific needs.
Load Channel
Reads XML documents and loads them into the
document repository.
Query Channel
Handles query request and response against the
document repository.
Test Channel
Simple visual interface displaying functionality and
usage of the Query API.
Copyright 2009, Information Builders. Slide 10
11. Technology Involved
ESB -
iWay Service Manager (commercial)
IBM WebSphere ESB (commercial)
Oracle Service Bus (commercial)
WS02 ESB (open source)
mongoDB - http://www.mongodb.org/
JSON - Java Script Object Notation
XQuery - XML query language
Copyright 2009, Information Builders. Slide 11
12. mongoDB
“Humongous”
Scalable, high-performance, document-oriented database.
JSON-style documents.
Mirror capable.
Auto-Sharding (clustering), horizontal scaling, automatic
failover, zero single point failure.
MapReduce support for complex processing. Work is
distributed among the cluster.
GridFS support.
A distributed file system.
Commercial support from 10gen (OEM by iWay Software)
Copyright 2009, Information Builders. Slide 12
13. XQuery
A query and functional programming language for XML
documents.
Is to XML documents what SQL is to databases.
“FLWOR” expressions.
FOR, LET, WHERE, ORDER BY, RETURN
Example:
for $x in /FEDREG/CNTNTS/AGCY where
$x/EAR=„Agricultural‟ order by $x ascending
return $x
Supports syntax for constructing new documents.
Copyright 2009, Information Builders. Slide 13
14. JSON – JavaScript Object Notation
The new data-interchange language of the web.
www.json.org
Copyright 2009, Information Builders. Slide 14
15. Base Loading Architecture
ESB
Listener Flow
XML to Store Store
JSON JSON XML
GridFS
Binary
mongoDB Storage
Copyright 2009, Information Builders. Slide 15
16. Base Query Architecture
ESB
Listener Flow
HTTP Query (Optional)
Requester DB Get XML
GridFS
Binary
mongoDB Storage
Copyright 2009, Information Builders. Slide 16
17. Loading Modification
External Storage
ESB
Listener Flow
XML to Store Store
JSON JSON XML
mongoDB File System
Copyright 2009, Information Builders. Slide 17
18. Loading Modification
SAP Loading Architecture
ESB
Flow
RFC IDOC to Store
XML Store
Server XML
JSON
SAP XML to Store
System JSON IDOC
GridFS
Binary
mongoDB Storage
Copyright 2009, Information Builders. Slide 18
19. Loading Modification
Change Data Capture Loading Architecture
ESB
Flow
CDC XML to Store Store
Listener JSON JSON XML
RDBMS
GridFS
Binary
mongoDB Storage
Copyright 2009, Information Builders. Slide 19
20. Loading Modification
Salesforce.com Loading Architecture
ESB
Flow
SOAP XML to Store Store
Listener JSON JSON XML
Salesforce
System
GridFS
Binary
mongoDB Storage
Copyright 2009, Information Builders. Slide 20
21. Loading Modification
FTP Loading Architecture
ESB
Flow
FTP XML to Store Store
Server JSON JSON XML
File
System
GridFS
Binary
mongoDB Storage
Copyright 2009, Information Builders. Slide 21
22. Query Modification
Web Service SOAP Query Architecture
ESB
Listener Flow
Web
(Optional)
Service SOAP Query
Get XML/
Client DB
IDOC
GridFS
Binary
mongoDB Storage
Copyright 2009, Information Builders. Slide 22
23. The Test Client
Note: The archive is designed to be called from other
flows or programs.
A simple AJAX based human interface for querying the XML
Archive.
Provides examples of the HTTP query interface provided by
the base XML Archive.
Installed with the base implementation of the XML Archive.
Copyright 2009, Information Builders. Slide 23
26. Basic Query
Return all documents who have the name attribute of
the <a> element equal to “bob”.
Copyright 2009, Information Builders. Slide 26
27. Advanced Queries
Query handler is a wrapper around the mongoDB
query language.
Support for:
And
Or
Regular Expressions
Ranges
Copyright 2009, Information Builders. Slide 27
28. Basic XQUERY
Return only the <b> element from the document.
Formatted Result:
Copyright 2009, Information Builders. Slide 28