SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
Why MarkLogic:
Addressing the Challenges of Unstructured Information
with Purpose-built Technology
             B
Table of Contents


                        	1	 | Introduction
                        	2	 | Characteristics of Unstructured Information	
Why MarkLogic:
           4 	| MarkLogic Addresses Unstructured Information

Addressing the Challenges of Unstructured Information
                	6	 | Summary

with Purpose-built Technology
                	7	 | About MarkLogic




             Abstract
             Rapidly changing conditions are forcing organizations to re-think how they use information
             to meet their objectives. Whether battling in the market place or on the battlefield, the
             need for flexibility and agility with information has never been greater. Organizations are
             looking to integrate and enrich information to create additional value for users. User ex-
             pectations are changing too, as they demand Web 2.0 and Enterprise 2.0 style applications
             that provide modern search capabilities, as well as an ability to interact with information
             through tagging and user generated comments. And various distribution channels present
             new challenges for information providers in exposing their information through rich user in-
             terfaces or through syndicated services like RSS and Atom feeds, allowing users to explore
             and access information in their own context.

             Choosing the right technology at the core of their application architecture is critical for
             any organization to provide them with the agility they need to meet these goals and rapidly
             respond to unforeseen changes. XML servers such as MarkLogic Server provide that agility
             by providing a single unified platform for storing, manipulating and delivering XML and
             building innovative information applications.

             This paper provides a technical overview of MarkLogic Server, the industry’s leading XML
             server, and also discusses some of the challenges facing organizations today for storing,
             repurposing, and dynamically delivering information.
Introduction
MarkLogic Server is a purpose-built database for unstructured informa-
tion. In this context, “unstructured information” refers to all information
that does not fit well in the rows and columns of a relational database
management system (RDBMS). In some cases, unstructured information
might be semi- or even highly structured, but due to specific characteris-
tics discussed in this paper, requires significant efforts to load, store, and
query in an RDBMS.

Most organizations recognize unstructured information as documents,
such as policies, manuals, contracts, reports, articles, cables, journals,
and legal briefs. Even media such as user-generated content, RSS feeds,
emails, social graphs, metadata, images, videos, and audio files are widely
used forms of unstructured information.

Most existing tools such as RDBMSs were not built to handle the challeng-
es of unstructured information. These tools either require rigid adherence
to a specific structure or ignore any existing structure altogether. In other
words, they treat unstructured information as second class citizens. This
precludes organizations from effectively leveraging information.




                                                               1   | MarkLogic whitepaper
Characteristics of Unstructured               • MDDL – Market Data Definition Language
                             Information                                   • DDMS – Department of Defense
                             To understand why today’s most common
                                                                            Discovery Metadata Specification
                             tools are insufficient for leveraging
                             unstructured information, it is useful to     Also consider the different document
                             review the specific characteristics of        formats such as PDF, HTML, Microsoft
                             unstructured information that require it      Office, RTF, etc. These options represent
                             to be treated differently than structured     the different ways unstructured infor-
                             information. This section discusses these     mation is stored.
                             characteristics while the next section will
                                                                           Contrast this heterogeneity to the homo-
                             discuss how MarkLogic addresses them.
                                                                           geneity of structured information, which
                             Heterogeneous                                 is stored in a consistent, tabular form.
                             The first important characteristic of         The data types in structured information
                             unstructured information is it is hetero-     primarily consist of numbers, dates, and
                             geneous. In other words, not only does it     fixed-length text strings, which limits its
                             look different from structured informa-       format variation. Database tables were
                             tion, but the many formats of unstructured    invented with this limited variation in mind.
                             information vary significantly from one
                                                                           Since unstructured information varies
                             another. Unstructured information includes
                                                                           greatly, it is not easily stored in tables.
                             non-discrete data types such as words,
                                                                           The challenge is unstructured information
                             sentences, and concepts, in conjunction
                                                                           must be mapped into tables and discrete
                             with discrete data types such as numbers,
                                                                           data types, which entails an unnatural and
                             dates, and identifiers. Many combina-
                                                                           time-consuming effort. As an alterna-
                             tions of these data types are possible, so
                                                                           tive, data types such as character/binary
                             standards are created to maintain manage-
                                                                           large objects (i.e., CLOBs and BLOBs) of
                             ability. However, the gains are not always
                                                                           an RDBMS were created to overcome the
                             clear, since great variance still exists as
                                                                           limitations of the discrete data types, but
                             evidenced by the many domain-specific
                                                                           they facilitate only storage, not querying.
                             standards such as:
                                                                           Therefore, CLOBs/BLOBs are marginally
                             • FpML – Financial products Markup            better than storage on a filesystem. The
                               Language                                    problem remains that RDBMSs treat
                                                                           unstructured information as second-
                             • OOXML – Office Open XML for Microsoft
                               Office 2007/2010                            class citizens. The monolithic approach
                                                                           of CLOBs/BLOBs ignores the important
                             • ISO 20022 – the ISO Standard for            context in unstructured information, and
                              Financial Services Messaging
                                                                           thus precludes analysis, retrieval, and
                             • XBRL – eXtensible Business Reporting        updates at a granular level.
                              Language
                                                                           Complex
                             • RixML – Research Information Markup         In addition to heterogeneity, unstruc-
                              Language                                     tured information is also very complex.
                                                                           There are several characteristics that
                             • DocBook – a popular markup language for
                               documentation                               contribute to complexity, any combina-
                                                                           tion of which are found in unstructured
                                                                           information.




2   | MarkLogic whitepaper
For one, unstructured information is           Changing in Unpredictable Ways
typically hierarchical, with nested parent/    When unstructured information evolves,
child relationships. Often these relation-     it changes in unpredictable and unan-
ships are not obvious, but examples            nounced ways. New standards, new
include subsections in a chapter of a book     sources, and new applications are created
or sub-clauses in a contract. On the other     continually. And there are generally no
hand, structured information typically         restrictions on how it is updated. Take an
has flat, tabular relationships that may be    example such as a contract. If an attorney
expressed as one-to-one, one-to-many, or       amends a contract to revise terms, she
many-to-many. Since RDBMSs were not            updates it in any way she desires without
designed for hierarchies, a query to join      formatting restrictions. She is not limited
rows to recreate the hierarchy is slow and     by the number of words or sentences,
inefficient.                                   or even by the location of the amended
                                               text. She typically uses a word processing
Unstructured information is irregular,
                                               program like Microsoft Word to make
meaning unstructured information does
                                               updates, and the user interface does
not fit in neat, predefined data elements.
                                               not have hard rules on how the contract
Information may vary greatly in length,
                                               should be changed. There also is no
with no pre-definition or bounded data
                                               preparation required by IT staff to plan
lengths. It might also be sparsely popu-
                                               for the changes, as the attorney makes
lated, meaning across a collection of
                                               the changes ad hoc.
information, there might be thousands
of known data elements, many of which          Contrast this to structured information,
are blank. These characteristics are           which changes in well-known ways.
inconsistent from what RDBMSs expect,          For example, each value in a RDBMS
in which most columns are expected to          changes in an expected way—numbers
be filled with values.                         are increased or decreased, dates are
                                               modified with other dates, and text
Finally, unstructured information may
                                               strings are updated within predefined
or may not conform to a predefined
                                               lengths. And when the schema changes,
schema. If it does conform, the schema
                                               the system is first updated to accom-
might be poorly defined, not followed
                                               modate that change. Schema changes
strictly, or not known in advance. Even
                                               must be announced before they can
in the case of predefined schemas, large
                                               be handled by the system. The IT staff
variances may be allowed, making each
                                               necessarily knows what type of changes
item appear very different from the
                                               will be made by users to structured
next. RDBMSs expect rigid, predefined
                                               information before the changes can be
schemas with predefined data elements,
                                               made. RDBMSs are good for predictable
so unstructured information is a poor fit.
                                               and announced changes, but are not
While some organizations try to map            efficient for the changes that unstructured
unstructured information into rows and         information undergoes.
columns, they face huge tradeoffs. Either
                                               Text-Centric
data accessibility is compromised, or the
                                               Unstructured information is heavily text-
system takes a significant performance
                                               centric. It contains language ambiguities
hit due to inefficient storage and indexing.




                                                                                             3   | MarkLogic whitepaper
typically not clear for processing by comput-    MarkLogic Addresses
                                           ers. For example, a word such as “foot” can      Unstructured Information
                                           have several different meanings including a      Based on the characteristics of un-
                                           body part, the bottom of something, or 12        structured information in the previous
                                           inches. The definition is dependent on the       section, it is clear today’s most popular
                                           context. Without proper context, users may       technologies are not able to fully lever-
                                           encounter many false positives, in which they    age unstructured information. RDBMSs
                                           retrieve irrelevant information. They may also   lack the flexibility to efficiently handle
“MarkLogic’s Universal Index is a key      encounter many false negatives, in which         unstructured information, and search
feature for addressing the heterogeneity   they miss relevant information described         engines lack the management and update
of unstructured information.”              using different terminology.                     capabilities that applications require.
                                                                                            Content management systems, which are
                                           Also, text within unstructured information
                                                                                            largely workflow-oriented applications
                                           lacks specific identifiers to help define
                                                                                            built on RDBMSs and search engines,
                                           various data elements. In comparison,
                                                                                            suffer the same challenges because of
                                           column names such as “first_name” in an
                                                                                            the limitations of the underlying platform.
                                           RDBMS table leave no ambiguity about
                                           meaning of the data values. While human          Despite this, many organizations still
                                           readers can easily find names in unstructured    try to use their current tools with
                                           information such as in a contract, it is         limited success. But now organizations
                                           far less obvious when processed by a             no longer have to compromise. Since
                                           computer. Since RDBMSs were designed             MarkLogic was designed for leveraging
                                           for tabular data, they do not have the           unstructured information, it has impor-
                                           functionality to properly handle the text-       tant features that lead to significant
                                           centric nature of unstructured information.      benefits. Some of those key features
                                                                                            are described below.
                                           Exponentially Growing
                                           Analysts estimate unstructured information       Universal Index
                                           grows 10 to 50 times faster than struc-          MarkLogic’s Universal Index is a key
                                           tured information. Information in gen-           feature for addressing the heterogeneity
                                           eral continues to grow at a tremendous           of unstructured information. It captures
                                           rate with one estimate at 800% over              all information users need for precise,
                                           the next five years. This rapid growth of        high-performance queries. Application
                                           unstructured information requires new            development teams spend less time on data
                                           approaches and strategies pertaining             modeling, re-modeling, and performance
                                           to performance and scalability. Though           tuning, thus expediting time-to-market and
                                           hardware advancements help with                  lowering total cost of ownership. Unstruc-
                                           scaling, those are only part of the solu-        tured information wants to be unrestricted,
                                           tion. Software must be optimized with            and the Universal Index allows that.
                                           modern hardware in mind to maximize
                                           efficiency. Organizations that rely on           The Universal Index allows users to
                                           older technologies must choose between           query all information that the system
                                           excessive expenditures or insufficient           sees, rather than only the information
                                           functionality when facing today’s                the system is told to see. In other words,
                                           unstructured information loads.                  the Universal Index enables MarkLogic
                                                                                            to make no presumptions around what




   4   | MarkLogic whitepaper
information should be expected and             can be added ad hoc without having to
enables the system to store information 	      redesign a schema. Third, XML has the
                                                                                           “To properly handle the complexity of unstructured
“as is” without requiring time-consuming       flexibility to fully capture and model
                                                                                           information, MarkLogic uses a data model based
data modeling to standardize dispa-            the unpredictable and irregular aspects
                                                                                           on XML documents, which is more efficient and
rate information formats. This is also         of unstructured information, including
                                                                                           effective for storing unstructured information
referred to as being “schema-agnostic”         non-discrete data elements, hierarchical
                                                                                           than the relational model.”
or “schema-permissive” in which any            elements, variable length characters, and
schema, or even non-existent schemas,          sparseness of data.
can be loaded into MarkLogic with no
                                               Using XML documents as the data model
prior planning. It automatically captures
                                               was a natural architectural decision for
all elements in information, including
                                               MarkLogic Server. XML is ideal for fully
words, structure, dates, and numbers.
                                               exploiting unstructured information
This means no information is lost, and all
                                               despite the heterogeneity, complexity,
elements can be queried and retrieved.
                                               and unpredictable change. MarkLogic’s
In addition to effectively handling het-       use of XML ensures it can handle current
erogeneous information, the Universal          and future requirements around unstruc-
Index also addresses the complexity of         tured information.
unstructured information due to hierarchy,
                                               Transaction Controller
irregularity, and poor schema definition. It
                                               Delays in access to information are often
also provides the flexibility to accommo-
                                               due to limitations in technology. With
date the wide variety of changes end users
                                               unpredictable changes in unstructured
make with their information.
                                               information—including those pertaining
XML Documents as the Data Model                to standards, formats, and content—
To properly handle the complexity of           the potential for delay is increased.
unstructured information, MarkLogic            MarkLogic Server was designed to
                                                                                           “MarkLogic Server was designed to immediately
uses a data model based on XML docu-           immediately accommodate those types
                                                                                           accommodate unannounced changes, thus eliminating
ments, which is more efficient and             of changes, thus eliminating the latency
                                                                                           the latency found in structured technologies.”
effective for storing unstructured             found in structured technologies. As
information than the relational model.         mentioned earlier, MarkLogic’s Universal
Support for W3C-standard XSLT and              Index and XML data model provide the
XQuery, both purpose-built for XML,            flexibility to offset the design overhead
enables fast and easy querying and             for new information types.
transformation. MarkLogic customers
                                               Those features represent only part
have experienced significant improvements
                                               of the real-time access capability.
in agility and efficiency by eliminating the
                                               MarkLogic’s ACID (atomicity, consist-
resource drain of trying to model and store
                                               ency, isolation, durability) transaction
unstructured information in an RDBMS.
                                               controller ensures newly inserted
An XML data model gives MarkLogic              information is indexed in real time
several important advantages for               and available to users immediately.
leveraging unstructured information.           Its multi-version concurrency control
First, embedded markup in XML creates          (MVCC) ensures rapid insertion with
context to enable granularity for access,      minimal resource contention. Index-
updates, reuse, and repurposing. Second,       ing can be done simultaneously with
XML is extensible so new data elements         heavy query loads with no blocking so




                                                                                                          5   | MarkLogic whitepaper
organizations do not have to settle for       faster discovery by end users. Geospatial
                                                  delayed information access. And for           searching enables location-based in-
“MarkLogic Server provides features to make
                                                  the most time-sensitive information,          formation retrieval. And finally, built-in
information clearer, and also provides several
                                                  MarkLogic’s real-time alerting quickly        co-occurrence analysis reveals hidden
techniques for finding evidence as the basis
                                                  and efficiently processes millions or         relationships between various entities
for relevance.”
                                                  billions of queries against a fast incoming   in a collection of information.
                                                  feed of new information.
                                                                                                Shared Nothing Architecture
                                                  Search and Analytics Capabilities             MarkLogic’s shared nothing architecture
                                                  Resolving language ambiguities is an          allows high performance and massive
                                                  “important requirement in handling text-      scalability to address the unanticipated
                                                  centric unstructured information. MarkLogic   growth of unstructured information.
                                                  Server helps in two ways to let end users     MarkLogic is optimized for commodity
                                                  find and make sense of the information they   hardware, and exhibits linear scaling
                                                  have. First, it provides features to make     to easily and efficiently grow to handle
                                                  information clearer. Second, it provides      future needs. As the user or informa-
                                                  several techniques for finding evidence as    tion load increases, performance and
                                                  the basis for relevance.                      response times can be maintained by
                                                                                                adding servers to a cluster.
                                                  To make information more clear,
“MarkLogic is optimized for commodity hard-
                                                  MarkLogic helps with the identification       MarkLogic has been deployed in clusters of
ware, and exhibits linear scaling to easily and
                                                  of meaning and context in information.        over 100 hardware servers, with expecta-
efficiently grow to handle future needs.”
                                                  For example, integration with entity          tions of customers moving well beyond that
                                                  enrichment tools enables identification       in the near future. Not only do customers
                                                  of entities such as people, places, and       gain cost savings by leveraging commodity
                                                  things. Range indexes provide structure       hardware, and fewer of them, but the lower
                                                  around specific values to enable precise      administrative overhead has resulted in
                                                  and fast retrievals, as well as sorting,      the ability to reallocate human resources
                                                  aggregations, and lookups. Support for        to higher value activities. At one customer
                                                  extensible metadata schemas allows            site, only one-half of a full-time equivalent
                                                  adding any type of identifying data to        is required to administer the 100-server
                                                  existing documents.                           MarkLogic cluster.

                                                  To improve relevance in searches, MarkLogic   Summary
                                                  Server provides capabilities found            The focus on unstructured information
                                                  in leading enterprise search engines          has increased over the years, but the
                                                  such as phrase, proximity, and thesaurus      ubiquity of RDBMSs has misled many
                                                  searches. In addition, MarkLogic sup-         organizations to make tradeoffs around
                                                  ports highly tunable relevance ranking        functionality, time-to-market, total
                                                  to more precisely match the end user’s        costs, and performance. Since RDBMSs
                                                  needs. The Universal Index captures all       were designed for structured information,
                                                  components of information to enable a         which is greatly different from unstruc-
                                                  higher level of specificity, granularity,     tured information, there is a clear
                                                  and structure in searches. Range indexes      mismatch that leads to costly inefficiencies.
                                                  enable classification and faceted
                                                                                                With its Universal Index, XML data
                                                  navigation, to help organize information
                                                                                                model, transaction controller, search and
                                                  in meaningful and structured ways for




    6   | MarkLogic whitepaper
analytics capabilities, and shared nothing
architecture, MarkLogic is the right choice
for tackling the challenges of unstructured
information. Customers report significant
gains with MarkLogic Server, including 10 to
100 times performance improvements, time-
to-market in weeks instead of years, and
scaling to hundreds of terabytes today
and petabytes tomorrow.

About MarkLogic
MarkLogic Corporation is revolutionizing
the way organizations leverage information.
Our flagship product is a purpose-built data-
base for unstructured information. Based on
patented innovations, MarkLogic Server
enables customers in industries including
media, government and financial services
to develop and deploy information appli-
cations at a fraction of the time and cost
it takes with conventional approaches.

The company is led by pioneers in search
engine technologies, database management
systems, and business intelligence software.
Our founder saw that the traditional ways of
managing and delivering information using
relational databases and search engines
were no longer sufficient. The increasing
volume and variety of information necessary
for enterprises to leverage required a
radically new approach.




                                                7   | MarkLogic whitepaper
MarkLogic Corporation
                                                                                                  www.marklogic.com
                                                                                                  sales@marklogic.com
                                                                                                  + 1 877 992 8885
                                                                                                  Headquarters
                                                                                                  999 Skyway Road, Suite 200
                                                                                                  San Carlos, CA 94070



Why MarkLogic:
Addressing the Challenges of Unstructured Information
with Purpose-built Technology




    © Copyright 2010 MarkLogic Corporation. MarkLogic is a registered trademark and MarkLogic Server is a trademark of MarkLogic Corporation, all
    rights reserved. All other product names mentioned herein are the property of their respective owners.

Mais conteúdo relacionado

Mais procurados

CURRENT AND FUTURE TRENDS IN DBMS
CURRENT AND FUTURE TRENDS IN DBMSCURRENT AND FUTURE TRENDS IN DBMS
CURRENT AND FUTURE TRENDS IN DBMSGayathri P
 
A Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data BanksA Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data Banksrenguzi
 
Birthof Relation Database
Birthof Relation DatabaseBirthof Relation Database
Birthof Relation DatabaseRaj Bhat
 
WHAT IS A DBMS? EXPLAIN DIFFERENT MYSQL COMMANDS AND CONSTRAINTS OF THE SAME.
WHAT IS A DBMS? EXPLAIN DIFFERENT MYSQL COMMANDS AND  CONSTRAINTS OF THE SAME.WHAT IS A DBMS? EXPLAIN DIFFERENT MYSQL COMMANDS AND  CONSTRAINTS OF THE SAME.
WHAT IS A DBMS? EXPLAIN DIFFERENT MYSQL COMMANDS AND CONSTRAINTS OF THE SAME.`Shweta Bhavsar
 
Fuzzy Rules for HTML Transcoding
Fuzzy Rules for HTML TranscodingFuzzy Rules for HTML Transcoding
Fuzzy Rules for HTML TranscodingVideoguy
 
Adding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenAdding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenDynamic People B.V.
 
Emerging DB Technologies
Emerging DB TechnologiesEmerging DB Technologies
Emerging DB TechnologiesTalal Alsubaie
 
UiTM IM110 IMD253 : ORGANIZATION OF INFORMATION (IMD253) Individual Assignment
UiTM IM110 IMD253 : ORGANIZATION OF INFORMATION (IMD253) Individual Assignment UiTM IM110 IMD253 : ORGANIZATION OF INFORMATION (IMD253) Individual Assignment
UiTM IM110 IMD253 : ORGANIZATION OF INFORMATION (IMD253) Individual Assignment Kumprinx Amin
 
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)Kumprinx Amin
 
Introduction to Databases and Transactions
Introduction to Databases and TransactionsIntroduction to Databases and Transactions
Introduction to Databases and Transactionsنبيله نواز
 
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)Kumprinx Amin
 
Pptsample dm km_mis
Pptsample dm km_misPptsample dm km_mis
Pptsample dm km_misLouie AU
 

Mais procurados (17)

Amazon SimpleDB
Amazon SimpleDBAmazon SimpleDB
Amazon SimpleDB
 
CURRENT AND FUTURE TRENDS IN DBMS
CURRENT AND FUTURE TRENDS IN DBMSCURRENT AND FUTURE TRENDS IN DBMS
CURRENT AND FUTURE TRENDS IN DBMS
 
A Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data BanksA Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data Banks
 
Birthof Relation Database
Birthof Relation DatabaseBirthof Relation Database
Birthof Relation Database
 
[EN] Document Management Market | Dr. Ulrich Kampffmeyer | DLM Forum 2000
[EN] Document Management Market | Dr. Ulrich Kampffmeyer | DLM Forum 2000[EN] Document Management Market | Dr. Ulrich Kampffmeyer | DLM Forum 2000
[EN] Document Management Market | Dr. Ulrich Kampffmeyer | DLM Forum 2000
 
WHAT IS A DBMS? EXPLAIN DIFFERENT MYSQL COMMANDS AND CONSTRAINTS OF THE SAME.
WHAT IS A DBMS? EXPLAIN DIFFERENT MYSQL COMMANDS AND  CONSTRAINTS OF THE SAME.WHAT IS A DBMS? EXPLAIN DIFFERENT MYSQL COMMANDS AND  CONSTRAINTS OF THE SAME.
WHAT IS A DBMS? EXPLAIN DIFFERENT MYSQL COMMANDS AND CONSTRAINTS OF THE SAME.
 
Current trends in DBMS
Current trends in DBMSCurrent trends in DBMS
Current trends in DBMS
 
Fuzzy Rules for HTML Transcoding
Fuzzy Rules for HTML TranscodingFuzzy Rules for HTML Transcoding
Fuzzy Rules for HTML Transcoding
 
Adding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenAdding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylen
 
Emerging DB Technologies
Emerging DB TechnologiesEmerging DB Technologies
Emerging DB Technologies
 
UiTM IM110 IMD253 : ORGANIZATION OF INFORMATION (IMD253) Individual Assignment
UiTM IM110 IMD253 : ORGANIZATION OF INFORMATION (IMD253) Individual Assignment UiTM IM110 IMD253 : ORGANIZATION OF INFORMATION (IMD253) Individual Assignment
UiTM IM110 IMD253 : ORGANIZATION OF INFORMATION (IMD253) Individual Assignment
 
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
 
Introduction to Databases and Transactions
Introduction to Databases and TransactionsIntroduction to Databases and Transactions
Introduction to Databases and Transactions
 
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
INFORMATION TECHNOLOGY IN INFORMATION AGENCIES (IMD257 / IMD204)
 
Database Lecture Notes
Database Lecture NotesDatabase Lecture Notes
Database Lecture Notes
 
Pptsample dm km_mis
Pptsample dm km_misPptsample dm km_mis
Pptsample dm km_mis
 
Dbms quick guide
Dbms quick guideDbms quick guide
Dbms quick guide
 

Semelhante a Why Mark Logic Addressing The Challenges Of Unstructured Information

A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases csandit
 
Oudg cross model datum access
Oudg cross model datum accessOudg cross model datum access
Oudg cross model datum accesscsandit
 
Survey of Object Oriented Database
Survey of Object Oriented DatabaseSurvey of Object Oriented Database
Survey of Object Oriented DatabaseEditor IJMTER
 
Comparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented DatabaseComparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented DatabaseEditor IJMTER
 
Module-1.pptx63.pptx
Module-1.pptx63.pptxModule-1.pptx63.pptx
Module-1.pptx63.pptxShrinivasa6
 
LESSON 1 - DATABASE MANAGEMENT SYSTEM.pptx
LESSON 1 - DATABASE MANAGEMENT SYSTEM.pptxLESSON 1 - DATABASE MANAGEMENT SYSTEM.pptx
LESSON 1 - DATABASE MANAGEMENT SYSTEM.pptxcalf_ville86
 
Making Inter-operability Visible
Making Inter-operability VisibleMaking Inter-operability Visible
Making Inter-operability Visibleliddy
 

Semelhante a Why Mark Logic Addressing The Challenges Of Unstructured Information (20)

A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
Comparision
ComparisionComparision
Comparision
 
Database Management System 1
Database Management System 1Database Management System 1
Database Management System 1
 
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
OUDG : Cross Model Datum Access with Semantic Preservation for Legacy Databases
 
Oudg cross model datum access
Oudg cross model datum accessOudg cross model datum access
Oudg cross model datum access
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
 
Survey of Object Oriented Database
Survey of Object Oriented DatabaseSurvey of Object Oriented Database
Survey of Object Oriented Database
 
Mis chapter 7 database systems
Mis chapter 7 database systemsMis chapter 7 database systems
Mis chapter 7 database systems
 
Comparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented DatabaseComparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented Database
 
Module-1.pptx63.pptx
Module-1.pptx63.pptxModule-1.pptx63.pptx
Module-1.pptx63.pptx
 
LESSON 1 - DATABASE MANAGEMENT SYSTEM.pptx
LESSON 1 - DATABASE MANAGEMENT SYSTEM.pptxLESSON 1 - DATABASE MANAGEMENT SYSTEM.pptx
LESSON 1 - DATABASE MANAGEMENT SYSTEM.pptx
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
Ordbms
OrdbmsOrdbms
Ordbms
 
DBMS introduction
DBMS introductionDBMS introduction
DBMS introduction
 
Making Inter-operability Visible
Making Inter-operability VisibleMaking Inter-operability Visible
Making Inter-operability Visible
 

Why Mark Logic Addressing The Challenges Of Unstructured Information

  • 1. Why MarkLogic: Addressing the Challenges of Unstructured Information with Purpose-built Technology B
  • 2. Table of Contents 1 | Introduction 2 | Characteristics of Unstructured Information Why MarkLogic: 4 | MarkLogic Addresses Unstructured Information Addressing the Challenges of Unstructured Information 6 | Summary with Purpose-built Technology 7 | About MarkLogic Abstract Rapidly changing conditions are forcing organizations to re-think how they use information to meet their objectives. Whether battling in the market place or on the battlefield, the need for flexibility and agility with information has never been greater. Organizations are looking to integrate and enrich information to create additional value for users. User ex- pectations are changing too, as they demand Web 2.0 and Enterprise 2.0 style applications that provide modern search capabilities, as well as an ability to interact with information through tagging and user generated comments. And various distribution channels present new challenges for information providers in exposing their information through rich user in- terfaces or through syndicated services like RSS and Atom feeds, allowing users to explore and access information in their own context. Choosing the right technology at the core of their application architecture is critical for any organization to provide them with the agility they need to meet these goals and rapidly respond to unforeseen changes. XML servers such as MarkLogic Server provide that agility by providing a single unified platform for storing, manipulating and delivering XML and building innovative information applications. This paper provides a technical overview of MarkLogic Server, the industry’s leading XML server, and also discusses some of the challenges facing organizations today for storing, repurposing, and dynamically delivering information.
  • 3. Introduction MarkLogic Server is a purpose-built database for unstructured informa- tion. In this context, “unstructured information” refers to all information that does not fit well in the rows and columns of a relational database management system (RDBMS). In some cases, unstructured information might be semi- or even highly structured, but due to specific characteris- tics discussed in this paper, requires significant efforts to load, store, and query in an RDBMS. Most organizations recognize unstructured information as documents, such as policies, manuals, contracts, reports, articles, cables, journals, and legal briefs. Even media such as user-generated content, RSS feeds, emails, social graphs, metadata, images, videos, and audio files are widely used forms of unstructured information. Most existing tools such as RDBMSs were not built to handle the challeng- es of unstructured information. These tools either require rigid adherence to a specific structure or ignore any existing structure altogether. In other words, they treat unstructured information as second class citizens. This precludes organizations from effectively leveraging information. 1 | MarkLogic whitepaper
  • 4. Characteristics of Unstructured • MDDL – Market Data Definition Language Information • DDMS – Department of Defense To understand why today’s most common Discovery Metadata Specification tools are insufficient for leveraging unstructured information, it is useful to Also consider the different document review the specific characteristics of formats such as PDF, HTML, Microsoft unstructured information that require it Office, RTF, etc. These options represent to be treated differently than structured the different ways unstructured infor- information. This section discusses these mation is stored. characteristics while the next section will Contrast this heterogeneity to the homo- discuss how MarkLogic addresses them. geneity of structured information, which Heterogeneous is stored in a consistent, tabular form. The first important characteristic of The data types in structured information unstructured information is it is hetero- primarily consist of numbers, dates, and geneous. In other words, not only does it fixed-length text strings, which limits its look different from structured informa- format variation. Database tables were tion, but the many formats of unstructured invented with this limited variation in mind. information vary significantly from one Since unstructured information varies another. Unstructured information includes greatly, it is not easily stored in tables. non-discrete data types such as words, The challenge is unstructured information sentences, and concepts, in conjunction must be mapped into tables and discrete with discrete data types such as numbers, data types, which entails an unnatural and dates, and identifiers. Many combina- time-consuming effort. As an alterna- tions of these data types are possible, so tive, data types such as character/binary standards are created to maintain manage- large objects (i.e., CLOBs and BLOBs) of ability. However, the gains are not always an RDBMS were created to overcome the clear, since great variance still exists as limitations of the discrete data types, but evidenced by the many domain-specific they facilitate only storage, not querying. standards such as: Therefore, CLOBs/BLOBs are marginally • FpML – Financial products Markup better than storage on a filesystem. The Language problem remains that RDBMSs treat unstructured information as second- • OOXML – Office Open XML for Microsoft Office 2007/2010 class citizens. The monolithic approach of CLOBs/BLOBs ignores the important • ISO 20022 – the ISO Standard for context in unstructured information, and Financial Services Messaging thus precludes analysis, retrieval, and • XBRL – eXtensible Business Reporting updates at a granular level. Language Complex • RixML – Research Information Markup In addition to heterogeneity, unstruc- Language tured information is also very complex. There are several characteristics that • DocBook – a popular markup language for documentation contribute to complexity, any combina- tion of which are found in unstructured information. 2 | MarkLogic whitepaper
  • 5. For one, unstructured information is Changing in Unpredictable Ways typically hierarchical, with nested parent/ When unstructured information evolves, child relationships. Often these relation- it changes in unpredictable and unan- ships are not obvious, but examples nounced ways. New standards, new include subsections in a chapter of a book sources, and new applications are created or sub-clauses in a contract. On the other continually. And there are generally no hand, structured information typically restrictions on how it is updated. Take an has flat, tabular relationships that may be example such as a contract. If an attorney expressed as one-to-one, one-to-many, or amends a contract to revise terms, she many-to-many. Since RDBMSs were not updates it in any way she desires without designed for hierarchies, a query to join formatting restrictions. She is not limited rows to recreate the hierarchy is slow and by the number of words or sentences, inefficient. or even by the location of the amended text. She typically uses a word processing Unstructured information is irregular, program like Microsoft Word to make meaning unstructured information does updates, and the user interface does not fit in neat, predefined data elements. not have hard rules on how the contract Information may vary greatly in length, should be changed. There also is no with no pre-definition or bounded data preparation required by IT staff to plan lengths. It might also be sparsely popu- for the changes, as the attorney makes lated, meaning across a collection of the changes ad hoc. information, there might be thousands of known data elements, many of which Contrast this to structured information, are blank. These characteristics are which changes in well-known ways. inconsistent from what RDBMSs expect, For example, each value in a RDBMS in which most columns are expected to changes in an expected way—numbers be filled with values. are increased or decreased, dates are modified with other dates, and text Finally, unstructured information may strings are updated within predefined or may not conform to a predefined lengths. And when the schema changes, schema. If it does conform, the schema the system is first updated to accom- might be poorly defined, not followed modate that change. Schema changes strictly, or not known in advance. Even must be announced before they can in the case of predefined schemas, large be handled by the system. The IT staff variances may be allowed, making each necessarily knows what type of changes item appear very different from the will be made by users to structured next. RDBMSs expect rigid, predefined information before the changes can be schemas with predefined data elements, made. RDBMSs are good for predictable so unstructured information is a poor fit. and announced changes, but are not While some organizations try to map efficient for the changes that unstructured unstructured information into rows and information undergoes. columns, they face huge tradeoffs. Either Text-Centric data accessibility is compromised, or the Unstructured information is heavily text- system takes a significant performance centric. It contains language ambiguities hit due to inefficient storage and indexing. 3 | MarkLogic whitepaper
  • 6. typically not clear for processing by comput- MarkLogic Addresses ers. For example, a word such as “foot” can Unstructured Information have several different meanings including a Based on the characteristics of un- body part, the bottom of something, or 12 structured information in the previous inches. The definition is dependent on the section, it is clear today’s most popular context. Without proper context, users may technologies are not able to fully lever- encounter many false positives, in which they age unstructured information. RDBMSs retrieve irrelevant information. They may also lack the flexibility to efficiently handle “MarkLogic’s Universal Index is a key encounter many false negatives, in which unstructured information, and search feature for addressing the heterogeneity they miss relevant information described engines lack the management and update of unstructured information.” using different terminology. capabilities that applications require. Content management systems, which are Also, text within unstructured information largely workflow-oriented applications lacks specific identifiers to help define built on RDBMSs and search engines, various data elements. In comparison, suffer the same challenges because of column names such as “first_name” in an the limitations of the underlying platform. RDBMS table leave no ambiguity about meaning of the data values. While human Despite this, many organizations still readers can easily find names in unstructured try to use their current tools with information such as in a contract, it is limited success. But now organizations far less obvious when processed by a no longer have to compromise. Since computer. Since RDBMSs were designed MarkLogic was designed for leveraging for tabular data, they do not have the unstructured information, it has impor- functionality to properly handle the text- tant features that lead to significant centric nature of unstructured information. benefits. Some of those key features are described below. Exponentially Growing Analysts estimate unstructured information Universal Index grows 10 to 50 times faster than struc- MarkLogic’s Universal Index is a key tured information. Information in gen- feature for addressing the heterogeneity eral continues to grow at a tremendous of unstructured information. It captures rate with one estimate at 800% over all information users need for precise, the next five years. This rapid growth of high-performance queries. Application unstructured information requires new development teams spend less time on data approaches and strategies pertaining modeling, re-modeling, and performance to performance and scalability. Though tuning, thus expediting time-to-market and hardware advancements help with lowering total cost of ownership. Unstruc- scaling, those are only part of the solu- tured information wants to be unrestricted, tion. Software must be optimized with and the Universal Index allows that. modern hardware in mind to maximize efficiency. Organizations that rely on The Universal Index allows users to older technologies must choose between query all information that the system excessive expenditures or insufficient sees, rather than only the information functionality when facing today’s the system is told to see. In other words, unstructured information loads. the Universal Index enables MarkLogic to make no presumptions around what 4 | MarkLogic whitepaper
  • 7. information should be expected and can be added ad hoc without having to enables the system to store information redesign a schema. Third, XML has the “To properly handle the complexity of unstructured “as is” without requiring time-consuming flexibility to fully capture and model information, MarkLogic uses a data model based data modeling to standardize dispa- the unpredictable and irregular aspects on XML documents, which is more efficient and rate information formats. This is also of unstructured information, including effective for storing unstructured information referred to as being “schema-agnostic” non-discrete data elements, hierarchical than the relational model.” or “schema-permissive” in which any elements, variable length characters, and schema, or even non-existent schemas, sparseness of data. can be loaded into MarkLogic with no Using XML documents as the data model prior planning. It automatically captures was a natural architectural decision for all elements in information, including MarkLogic Server. XML is ideal for fully words, structure, dates, and numbers. exploiting unstructured information This means no information is lost, and all despite the heterogeneity, complexity, elements can be queried and retrieved. and unpredictable change. MarkLogic’s In addition to effectively handling het- use of XML ensures it can handle current erogeneous information, the Universal and future requirements around unstruc- Index also addresses the complexity of tured information. unstructured information due to hierarchy, Transaction Controller irregularity, and poor schema definition. It Delays in access to information are often also provides the flexibility to accommo- due to limitations in technology. With date the wide variety of changes end users unpredictable changes in unstructured make with their information. information—including those pertaining XML Documents as the Data Model to standards, formats, and content— To properly handle the complexity of the potential for delay is increased. unstructured information, MarkLogic MarkLogic Server was designed to “MarkLogic Server was designed to immediately uses a data model based on XML docu- immediately accommodate those types accommodate unannounced changes, thus eliminating ments, which is more efficient and of changes, thus eliminating the latency the latency found in structured technologies.” effective for storing unstructured found in structured technologies. As information than the relational model. mentioned earlier, MarkLogic’s Universal Support for W3C-standard XSLT and Index and XML data model provide the XQuery, both purpose-built for XML, flexibility to offset the design overhead enables fast and easy querying and for new information types. transformation. MarkLogic customers Those features represent only part have experienced significant improvements of the real-time access capability. in agility and efficiency by eliminating the MarkLogic’s ACID (atomicity, consist- resource drain of trying to model and store ency, isolation, durability) transaction unstructured information in an RDBMS. controller ensures newly inserted An XML data model gives MarkLogic information is indexed in real time several important advantages for and available to users immediately. leveraging unstructured information. Its multi-version concurrency control First, embedded markup in XML creates (MVCC) ensures rapid insertion with context to enable granularity for access, minimal resource contention. Index- updates, reuse, and repurposing. Second, ing can be done simultaneously with XML is extensible so new data elements heavy query loads with no blocking so 5 | MarkLogic whitepaper
  • 8. organizations do not have to settle for faster discovery by end users. Geospatial delayed information access. And for searching enables location-based in- “MarkLogic Server provides features to make the most time-sensitive information, formation retrieval. And finally, built-in information clearer, and also provides several MarkLogic’s real-time alerting quickly co-occurrence analysis reveals hidden techniques for finding evidence as the basis and efficiently processes millions or relationships between various entities for relevance.” billions of queries against a fast incoming in a collection of information. feed of new information. Shared Nothing Architecture Search and Analytics Capabilities MarkLogic’s shared nothing architecture Resolving language ambiguities is an allows high performance and massive “important requirement in handling text- scalability to address the unanticipated centric unstructured information. MarkLogic growth of unstructured information. Server helps in two ways to let end users MarkLogic is optimized for commodity find and make sense of the information they hardware, and exhibits linear scaling have. First, it provides features to make to easily and efficiently grow to handle information clearer. Second, it provides future needs. As the user or informa- several techniques for finding evidence as tion load increases, performance and the basis for relevance. response times can be maintained by adding servers to a cluster. To make information more clear, “MarkLogic is optimized for commodity hard- MarkLogic helps with the identification MarkLogic has been deployed in clusters of ware, and exhibits linear scaling to easily and of meaning and context in information. over 100 hardware servers, with expecta- efficiently grow to handle future needs.” For example, integration with entity tions of customers moving well beyond that enrichment tools enables identification in the near future. Not only do customers of entities such as people, places, and gain cost savings by leveraging commodity things. Range indexes provide structure hardware, and fewer of them, but the lower around specific values to enable precise administrative overhead has resulted in and fast retrievals, as well as sorting, the ability to reallocate human resources aggregations, and lookups. Support for to higher value activities. At one customer extensible metadata schemas allows site, only one-half of a full-time equivalent adding any type of identifying data to is required to administer the 100-server existing documents. MarkLogic cluster. To improve relevance in searches, MarkLogic Summary Server provides capabilities found The focus on unstructured information in leading enterprise search engines has increased over the years, but the such as phrase, proximity, and thesaurus ubiquity of RDBMSs has misled many searches. In addition, MarkLogic sup- organizations to make tradeoffs around ports highly tunable relevance ranking functionality, time-to-market, total to more precisely match the end user’s costs, and performance. Since RDBMSs needs. The Universal Index captures all were designed for structured information, components of information to enable a which is greatly different from unstruc- higher level of specificity, granularity, tured information, there is a clear and structure in searches. Range indexes mismatch that leads to costly inefficiencies. enable classification and faceted With its Universal Index, XML data navigation, to help organize information model, transaction controller, search and in meaningful and structured ways for 6 | MarkLogic whitepaper
  • 9. analytics capabilities, and shared nothing architecture, MarkLogic is the right choice for tackling the challenges of unstructured information. Customers report significant gains with MarkLogic Server, including 10 to 100 times performance improvements, time- to-market in weeks instead of years, and scaling to hundreds of terabytes today and petabytes tomorrow. About MarkLogic MarkLogic Corporation is revolutionizing the way organizations leverage information. Our flagship product is a purpose-built data- base for unstructured information. Based on patented innovations, MarkLogic Server enables customers in industries including media, government and financial services to develop and deploy information appli- cations at a fraction of the time and cost it takes with conventional approaches. The company is led by pioneers in search engine technologies, database management systems, and business intelligence software. Our founder saw that the traditional ways of managing and delivering information using relational databases and search engines were no longer sufficient. The increasing volume and variety of information necessary for enterprises to leverage required a radically new approach. 7 | MarkLogic whitepaper
  • 10. MarkLogic Corporation www.marklogic.com sales@marklogic.com + 1 877 992 8885 Headquarters 999 Skyway Road, Suite 200 San Carlos, CA 94070 Why MarkLogic: Addressing the Challenges of Unstructured Information with Purpose-built Technology © Copyright 2010 MarkLogic Corporation. MarkLogic is a registered trademark and MarkLogic Server is a trademark of MarkLogic Corporation, all rights reserved. All other product names mentioned herein are the property of their respective owners.