SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Second Quarter 2012


Headline
subhead
Various permutations of text (word processing files, simple text files, emails etc.) make up the largest amount of unstructured data cur-
rently in the enterprise. Many firms are in the process of implementing unstructured data management projects to find useful informa-
tion from the immensity of corporate email.

Content Management Systems exist partially to help an enterprise manage and derive information from the data contained in unstruc-
tured text documents. Most of these systems leverage metadata to provide an extra layer of classification allowing for easier searches
and enhanced reporting.




Unstructured Data
and the Enterprise
by Paul Williams
Collaborated with Christine Connors




                                                     ™




                                                                                                        dataversity.net              1
table of contents
Executive Summary .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 1

The Growing Industry around Unstructured Data .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 2

	              Current Usage of Unstructured Data in the Enterprise  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 2
                                                                   . .

Unstructured Data Formats .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 5

	              Text Documents.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 5

	              Web Pages  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 5
                       . .

	              Media Formats (Audio, Video, Images)  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 5
                                                  . .

	              Office Software Data Formats (PowerPoint, MS Project).  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 5

Capturing and Managing Unstructured Data .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 6

	Data Scraping and Text Parsing.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 6

	Metadata.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 7

	Taxonomy .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 9

	              eDiscovery (electronic discovery) .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 10

	Discovery Systems  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 11
                 . .

	              Text Analytics.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 11

Insert By Christine Connors, Principal, TriviumRLG LLC  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 13
                                                      . .

	              Integrating Unstructured Data and Putting it to Use in your Real World.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 13

	              Requirements for Unstructured Data Projects (Christine Connors, Principal, TriviumRLG LLC) .  .  .  .  .  .  .  .  . 14

Summing Up the Opportunities Created by Unstructured Data.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 15

Appendix .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 16

	              Open Source and Commercial Applications around Unstructured Data.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 16

	              Taxonomy Providers .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 16

	              Content Management Systems.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 17

	Discovery Systems  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 17
                 . .

	Metadata.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 18

	About Dataversity.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 18


                                                                                                                                                                                   dataversity.net                                  1
™




Executive Summary
In its most basic definition, unstructured data simply means any form of data that does not easily fit into a relational model or a set of
database tables. Unstructured data exists in a variety of formats: books, audio, video, or even a collection of documents. In fact, some
of this data may very well contain a measure of structure, such as chapters within a novel or the markup on a HTML Web page, but
not a full data model typical of relational databases.

➺ Anywhere from 40 to 80 percent of an enterprise’s stored data currently resides in an unstructured format. While most
   unstructured data is in various text formats, other formats include audio, video, Web pages, and office software data.

➺ Firms able to successfully capture and manage their unstructured data hold a competitive advan-
  
   tage over firms unable to do the same. Over 90 percent of enterprises are currently planning to manage unstructured data or are
   already doing it.

➺ Metadata markup, text analytics, data mining, and taxonomy creation are all industry-proven tech-
  
   niques used to capture an enterprise’s unstructured data. Included case studies show these techniques in action as part of success-
   ful problem-solving projects.

➺ Difficulty in integration with enterprise systems along with the immaturity of the currently available unstructured data
  
   management tools are two major barriers to the successful management of unstructured data.

Many industry pundits claim that 80 to 85 percent of all enterprise data resides in an unstructured format. Some studies counter that
statistic, arguing the true percentage is significantly less. DATAVERSITY’s own 2012 survey reflects a percentage below the common-
ly stated 80 percent. No matter the actual percentage compared to structured formats, there is little doubt the amount of unstructured
data continues to grow.

Gartner Research, one organization quoting the 80 percent metric for unstructured data, predicts a nearly 800 percent growth in the
amount of enterprise data over the next 5 years. The large majority of this data is expected to be unstructured. This data growth is one
factor driving corporate investments in Big Data, Cloud Computing, and NoSQL.




                                                                                                          dataversity.net             1
The Growing Industry
around Unstructured Data
A large industry has grown around the task of deriving valuable business information
out of unstructured data. With parsers scraping information out of pages and pages of
                                                                                           What industry
text, in addition to full systems built around data taxonomy and discovery, many op-       are you in?
tions exist for any enterprise trying to make sense of their unstructured information.                                         0%     5%     10%    15%   20%


                                                                                           TECHNOLOGY/COMPUTERS/SOFTWARE
Commercially available Content Management Systems (CMS) use metadata to pro-                                   GOVERNMENT

vide more accurate searching of an organization’s document library. In fact, metadata
                                                                                                     OTHER (PLEASE SPECIFY)
                                                                                                     IT CONSULTING SERVICES
remains a very important tool in the handling of any unstructured data. Business Intel-                           EDUCATION

ligence system providers also offer mature solutions around making sense of the vast
                                                                                              SOFTWARE/SYSTEM DEVELOPMENT
                                                                                                                HEALTHCARE
arrays of an enterprise’s seemingly unrelated information.                                             TELECOMMUNICATIONS
                                                                                                          FINANCIAL SERVICES
                                                                                                                  INSURANCE
Despite any associated costs, enterprises better able to leverage meaningful Business                        MANUFACTURING

Intelligence from unstructured data gain a competitive advantage over those compa-
                                                                                             RETAIL/WHOLESALE/DISTRIBUTION
                                                                                                             ENTERTAINMENT
nies who cannot.                                                                                                      MEDIA
                                                                                                                 MARKETING
                                                                                                                 PUBLISHING
Once that data is successfully under management, finding the best techniques and ap-
plications for leveraging the data remains key in deriving a competitive advantage for                                                           Graph #1
any organization.



Current Usage of Unstructured                                                              number of employees
Data in the Enterprise                                                                     in your company?
DATAVERSITY recently sent out a survey to its readership base covering a host of
                                                                                                           0%       5%          10%        15%     20%    25%


topics related to unstructured data in the enterprise. With nearly 400 respondents, the    LESS THAN 10

answers provide a reliable cross section of industry types and organizational sizes.             10-100
                                                                                              100-1,000

The survey results provide a unique insight as to how organizations perceive their
                                                                                             1,000-5,000
                                                                                            5,000-10,000
unstructured data problem, what steps they are taking to derive meaningful business        10,000-50,000
information from that data, what formats the data resides in, and finally some demo-        OVER 50,000

graphic information about their job function, organization, and industry.
                                                                                                                                                 Graph #2
Referencing the organizational demographics of the survey respondents provides a
sense of the types and sizes of the companies attempting to manage unstructured data.
In short, a wide cross section of organizational sizes and industries are currently man-
aging or considering implementing projects to manage unstructured data. See Graphs
One and Two below:




                                                                                                           dataversity.net                           2
Documents are the unstructured data type most commonly under management or considered for management, followed closely by
emails, presentations, and scanned documents. Various media formats (images, audio, and video) and social media chatter are also
important. See Graph Three below:


What Types of Unstructured Data Does Your Organization
Currently Manage, or is Considering for Management?

                                  0%                      20%                      40%         60%           80%              100%

                   DOCUMENTS
                         EMAIL
SCANNED DOCUMENTS OR IMAGES
                 PRESENTATIONS
         SOCIAL MEDIA CHATTER
            DRAWING/GRAPHICS
                         VIDEO
                         AUDIO
                 PHOTOGRAPHS
             GEOSPATIAL IMAGES
     NONE - WE HAVE NO PLANS
         OTHER (PLEASE SPECIFY)                                                                                      Graph #3




Most organizations hope to improve business efficiencies and reduce costs at their organization through the successful management
of unstructured data. Additional potential benefits from unstructured data management include the improvement of communication,
deriving sales leads, meeting compliance requirements, improving data governance, improving enterprise search, parsing and analyz-
ing social media channels, and Business Intelligence system integration.

The main barriers to improving unstructured data at the enterprise reflect the growing maturity of the practice. Two major barriers to
the successful management of unstructured data are difficulty in integration with enterprise systems along with the immaturity of the
currently available unstructured data management tools. Cost, lack of executive support, and a lack of unstructured data knowledge
among IT staff are also major barriers. See Graph Four on the following page.


what are the main barriers to effectively improving the management of
unstructured data within your company?

                                                                              0%         10%    20%   30%           40%              50%


  DIFFICULTY OF INTEGRATING UNSTRUCTURED DATA WITH OTHER ENTERPRISE SYSTEMS
                 LACK OF MATURE TOOLS/SYSTEMS FOR ENTERPRISE MANAGEMENT
                                        REQUIREMENTS ARE NOT WELL DEFINED
                  SHORTAGE OF SKILLS/KNOWLEDGE AMONGST EXISTING IT STAFF
                                                  COMPLEXITY OF SOLUTIONS
                      HIGH COST OF AVAILABLE TECHNOLOGIES AND SOLUTIONS
                                           TOO MUCH TO DO, TOO LITTLE TIME
                              LACK OF EXECUTIVES OR BUSINESS-LEVEL SUPPORT
         BUSINESS CASE IS NOT UNDERSTOOD, OR NOT SUFFICIENTLY COMPELLING
                                                 LACK OF AUTOMATED TOOLS
                            LACK OF GOVERNANCE SUPPORT IN TOOLS/SYSTEMS
                                                                                                                          Graph #4




                                                                                                            dataversity.net            3
System integration with existing structured data appears to be the key technology challenge of the unstructured data management
process across most industries (see Graphic Five below). Determining project requirements is another key challenge, especially in the
insurance and entertainment industries, as well as improving semantic consistency between multiple systems. Other relevant chal-
lenges include security, taxonomy development, scalability, and search result integration across multiple platforms.



What are the key technology challenges you have encountered when
attempting to build systems or applications for managing
Unstructured Data?




                                                                                                                                                                                                                                                              IT Consulting Services
                                                                                                                                          Retail / Wholesale /
                                                                   Computers / Software



                                                                                           Telecommunications




                                                                                                                                                                                                                                                                                       Software / System
                                                                                                                                                                                   Financial Services
                                                                                                                                                                  Manufacturing




                                                                                                                                                                                                                                                                                                            Industry Other
                                                                                                                                                                                                                      Entertainment




                                                                                                                                                                                                                                                                                       Development
                                        Government




                                                                   Technology /




                                                                                                                                          Distribution
                                                      Healthcare




                                                                                                                                                                                                         Marketing




                                                                                                                                                                                                                                                Publishing
                                                                                                                 Education



                                                                                                                              Insurance




                                                                                                                                                                                                                                       Media
Industry

Determining Project Requirements       55.9%         47.1%         32.1%                  26.7%                 33.3%        55.6%        16.7%                  33.3%            37.5%                 25.0%        50.0%            75.0%    100.0%        35.7%                     25.0%               47.4%

Methods for integrating with
                                       58.8%         64.7%         56.6%                  26.7%                 41.7%        50.0%        33.3%                  73.3%            62.5%                 100.0%       100.0%           75.0%    50.0%         39.3%                     20.0%               47.7%
existing structured data

Natural language
                                       11.8%         29.4%         17.0%                  13.3%                 20.8%        27.8%        33.3%                  13.3%            25.0%                  0.0%        25.0%            25.0%    50.0%         32.1%                     40.0%               18.4%
understanding

Entity and/or concept extraction and
                                       29.4%         23.5%         35.8%                  13.3%                 20.8%        22.2%        33.3%                  0.0%             31.3%                  0.0%        50.0%            25.0%    50.0%         25.0%                     30.0%               26.3%
analysis

Filtering / summarizing relevant
                                       23.5%         23.5%         35.8%                  40.0%                 33.3%        27.8%        33.3%                  60.0%            25.0%                 50.0%        25.0%            0.0%     50.0%         21.4%                     40.0%               34.2%
information

Semantic consistency across multiple
                                       47.1%         23.5%         41.5%                  33.3%                 37.5%        44.4%        0.0%                   26.7%            37.5%                  0.0%        75.0%            75.0%    50.0%         42.9%                     40.0%               36.8%
systems

High storage requirements              17.6%         17.6%         28.3%                  26.7%                 25.0%        33.3%        66.7%                  26.7%            25.0%                 25.0%        25.0%            25.0%    50.0%         10.7%                     35.0%               28.9%

Poor system performance
                                       11.8%         23.5%         26.4%                  20.0%                 16.7%        22.2%        16.7%                  20.0%            25.0%                  0.0%        25.0%            0.0%     100.0%        21.4%                     20.0%               15.8%
at large scale

System modeling and design             17.6%         29.4%         18.9%                  13.3%                 20.8%        33.3%        16.7%                  40.0%            37.5%                 25.0%        75.0%            25.0%    50.0%         14.3%                     30.0%               21.1%

Digital Rights management              20.6%         5.9%          9.4%                   6.7%                  12.5%        5.6%         0.0%                   6.7%             0.0%                   0.0%        25.0%            0.0%     50.0%         7.1%                      5.0%                5.3%

Taxonomy development                   35.3%         23.5%         24.5%                  0.0%                  25.0%        11.1%        0.0%                   40.0%            37.5%                  0.0%        50.0%            50.0%    50.0%         25.0%                     25.0%               23.7%

Integrating search results across
                                       23.5%         23.5%         26.4%                  20.0%                 16.7%        27.8%        0.0%                   26.7%            56.3%                 25.0%        75.0%            0.0%     50.0%         21.4%                     25.0%               47.4%
multiple platforms

Security                               32.4%         35.3%         17.0%                  53.3%                 37.5%        16.7%        16.7%                  26.7%            31.3%                  0.0%        25.0%            50.0%    50.0%         28.6%                     15.0%               28.9%

Information Quality problems           32.4%         17.6%         18.9%                  40.0%                 20.8%        44.4%        0.0%                   20.0%            31.3%                 25.0%         0.0%            25.0%    50.0%         25.0%                     45.0%               42.1%




                                                                                                                                                                                                                                      dataversity.net                                                      4
Unstructured Data Formats
Text Documents
Various permutations of text (word processing files, simple text files, emails etc.) make up the largest amount of unstructured data cur-
rently in the enterprise. Many firms are in the process of implementing unstructured data management projects to find useful informa-
tion from the immensity of corporate email.

Content Management Systems exist partially to help an enterprise manage and derive information from the data contained in unstruc-
tured text documents. Most of these systems leverage metadata to provide an extra layer of classification allowing for easier searches
and enhanced reporting.


Web Pages
Web pages are unique in the world of unstructured data. In fact, an argument can be made that HTML markup itself provides a mea-
sure of structure. Web sites that are primarily data-driven might even use a fully normalized database as their back end. In addition to
pure HTML markup, many of these Web-based applications are written in Java, PHP, .NET and other development frameworks that
render HTML output.

The proprietary algorithms utilized by search engine providers use HTML meta tags in addition to in-text keyword analysis to help
tailor search results for the user. Of course, some of that same logic also works in generating Internet advertising for Web surfers, with
the ad revenue leading to financial success for the advertising providers.

The rapid growth of social media and the interactive Web is also creating large amounts of unstructured data as well as opportunities
for those companies able to manage and derive value from that data. The rising popularity of graph databases optimized for finding
relationships between social media users and their consumer preferences (along with others in their social networks) is arguably due to
companies hoping to leverage revenue from unstructured data analysis.



Media Formats (Audio, Video, Images)
Audio, video, and image files are all forms of unstructured data. Intelligent real-time analysis of audio data is commonplace in digital
audio recording and processing, and also used to a lesser extent with the other two formats.

Metadata is a must with media files; it provides necessary additional classification. A trip around the iTunes music store provides an
excellent opportunity to see this metadata in action, as band names, genres, and related artists drive Apple’s Genius and other music
recommendation services like the Pandora Internet radio station.



Office Software Data Formats (PowerPoint, MS Project)
Files created using Microsoft Office or any other office suite run the gamut of data formats. Microsoft Access creates and manages
fully structured database files in its own format. Excel and PowerPoint files both provide challenges to organizations looking to
include information from these formats in their corporate reporting mechanisms. Microsoft’s VBA language provides a measure of
functionality in parsing meaningful information from Office files.

Commercial applications with proprietary data formats include various customer management (CRM) tools, larger enterprise resource
planning (ERP) applications like SAP, or even architectural drafting applications like AutoCAD. In some cases, this software lever-
ages a semi-structured data format such as XML for data exchange between applications.




                                                                                                         dataversity.net             5
                                                                                                                                     1
Capturing and Managing
Unstructured Data
Before any enterprise can derive meaningful information from the mass of unstructured data, the process to capture that data needs
consideration. Data scraping or text parsing involves the extraction of information from data at its most basic level, but other methods
and tools provide a more measured approach.


Data Scraping and Text Parsing
Data Scraping is a technique where human-readable information is extracted from a computer system by another program. It is impor-
tant to distinguish the “human readable” aspect of data scraping from the typical data exchange between computers which can involve
structured formats.

The technique first came into vogue with screen scraping, used during the advent of client-server computing, as many organizations
struggled with the volume of data residing in legacy mainframe applications. The data from these mainframe apps, parsed from termi-
nal screens, was usually imported into some form of reporting or Business Intelligence application. Data Scraping can also be neces-
sary when attempting to interface to any legacy system without an application programming interface (API).

In recent times, Web scraping has followed a similar technique as screen scraping, allowing meaningful data from Web pages to be
parsed, scrubbed, and stored in a relational database. There are currently many applications using some form of Web scraping, espe-
cially in the areas of Web mashup and Web site integration, although in many cases APIs provide a more structured process.

Report Mining is similar to data and Web scraping in that it involves deriving meaningful information from a collection of static,
human-readable reports. This technique can also be useful in an enterprise’s software development QA process, such as facilitating
regression test result analysis.




   Data Scraping and Text Parsing Case Study



A Fledgling Financial Services Company Leverages
Technology; Plays with the Big Boys
In the mid-1990s, equipped with investment dollars, ARM Financial Group went on a buying binge, acquiring a collection of small
insurance companies that specialized in retirement products, such as annuities and structured settlements. Its short-term quest focused
on rapidly increasing the dollar value of its assets under management and then profiting from improved efficiencies around the admin-
istration of these assets.

Getting accounting systems under control is normally the first step when acquiring any company in the financial sector. This is usually
followed by bringing the policy administration systems online with the new company. ARM faced a dilemma concerning the older
mainframe systems from a newly acquired firm.

The company made the choice to go client-server for their annuity processing systems; they were one of the first firms in the insurance
sector to do so. While massive projects grew around converting policy records from the old mainframe system to the new client-serv-
er based system, a more elegant solution was quickly developed to handle the customer service role.




                                                                                                        dataversity.net            6
                                                                                                                                   1
Screen scraping was used to grab policy data from mainframe terminals and store the data in a simple relational database, providing
customer service agents with a means to access policy data when on the phone with policyholders. This screen scraping project was
online months before all the policy data were converted and stored into the new client-server system’s database.

While the policy data in the older mainframe system was in a structured format, the use of screen scraping allowed the rapid develop-
ment of a needed solution for the customer service function at the new company.

As ARM continued to grow, improving operational efficiency became vital in maximizing profits. Despite the company’s state-of-the-
art, Web-based system for new annuities, nearly all business was in the form of paper -- both for new policies and changes to existing
policies.

Implementing an imaging and workflow system was crucial in re-engineering processes for controlling all aspects of an annuity’s
policy lifecycle. With most imaging systems, automated text parsing is vital in grabbing information from a document and storing
it in some form of database. The workflow elements of the new imaging system at the financial services company were also highly
dependent on the manual scraping of important metadata from essentially unstructured paper policy documents.

This metadata was essential when routing a policy through the processing workflows at the company. The imaging and workflow sys-
tem also used a SQL Server database to persist important data captured from the policy. While some of the policy data was the same
as what was stored in the annuity administration system, having it also stored in the workflow system with a SQL Server back end
allowed the rapid development of applications in Visual Basic to support customer service and policy management functions.

These new systems proved to be an early form of Business Intelligence application made possible by the leveraging of both manual
and automated screen scraping and text parsing to derive useful information from a mass of unstructured paper documents.

In the late 1990s the concepts around what is now called unstructured data were still in their infancy. The technological successes of a
small financial services company in pioneering insurance processing allowed it to compete with much larger firms. These successes
had a lot to do with recognizing what concepts like screen scraping and text parsing could do to make annuity processing more ef-
ficient by allowing the rapid development of new systems to support both customer service and management roles.


Metadata
Sometimes defined as “data about data,” metadata is often a useful tool when dealing with unstructured data residing in text docu-
ments. In some cases, metadata serves a similar role as a data dictionary by describing the content of data, while other times it is used
in more of a markup or tagging purpose.

While some metadata is embedded directly in the document it describes, metadata can also be stored in a database or some other re-
pository. Many enterprises build a formal metadata registry with limited write access. This registry can be shared internally or exter-
nally through a Web service interface.

Many metadata standards specific to various industries have developed over time. One example of this is the Dublin Core Metadata
Initiative (DCMI), used extensively in library sciences and other disciplines for the purposes of online resource discovery.

Metadata schema syntax can be in a variety of text or markup formats, including HTML, XML, or even plain text. Both ANSI and ISO
are active in developing and enforcing standards for expressing metadata syntax.

A type of metadata, meta tags are used in HTML to mark a Web page with words describing the content of the page. In the past, search
engines relied mostly on meta tags when building result sets based on a search query, although their relevance has waned in recent
times.




                                                                                                       dataversity.net
                                                                                                        dataversity.net              7
                                                                                                                                     1
METADATA CASE STUDY


Leveraging Metadata to Improve Unstructured
Document Searching
The Education Department of a large Midwestern state faced a problem: the state’s teachers needed a system to assist them in organiz-
ing appropriate instructional materials for the classroom. The materials primarily resided as documents in the Education Department’s
Documentum system. The system had a small amount of metadata in it, but there was no functional tool that allowed the easy search-
ing and collecting of that content.

The acquisition of a Google Search Appliance device promised to provide some measure of search functionality, but Google’s inter-
face is predicated on one single text box allowing for full-text search without the ability to narrow or filter those search results. The
teachers needed the ability to search by criteria such as grade level, subject area, or the type of content (lesson plans, content stan-
dards, etc.).

Developing a solution for the state relied on a multi-tier approach that first built a more robust Web search interface that added catego-
rized collections of check boxes for the narrowing of search results, in addition to Google’s standard text box for full-text searching.

A school backpack metaphor was used as the “shopping cart” for this Web-based application. As a teacher searched through the
instructional materials, anything they wanted to use in the classroom was simply saved in their backpack for later retrieval. Different
backpacks could be saved for each teacher depending on their needs.

This front-end search interface and persistent backpack model would not function without improving the metadata stored on each
document in Documentum. A side project involved retrofitting each instructional material document with the relevant metadata for
grade level, subject, and content type.

Luckily, Google’s Search Appliance API allowed additional data to be passed into the search request. Testing revealed the combination
of full-text search with additional filtering provided by the Web interface and enhanced metadata returned the relevant instructional
materials in the search result set.

While Microsoft’s IIS and Active Server Pages were the preferred Web development solution at the Education Department, they were
a few years behind in fully implementing the .NET framework. Because of that, the project became an interesting hybrid of Classic
ASP at the front end with .NET code used to provide the backpack storage functionality, along with the Web service interface for the
Google Search Appliance.

Considering the additional search criteria, it was determined the standard Web page output of the Google Search Appliance was not
sufficient to serve the needs of the application. The appliance’s search result set was available in XML format, so back-end code was
written to enhance the output with graphics, a detailing of the search criteria, and a link to add the specific instructional material to
the teacher’s backpack. This additional output combined nicely with the standard document summary normally provided by Google’s
search.
Even though a variety of pieces went into creating the best solution for teachers, leveraging improved metadata supplied the project’s
ultimate success. Straight out of the box, the Google Search Appliance did not provide the necessary search result filtering needed to
return the correct instructional material from a mass of unstructured documents. By enhancing the metadata stored in Documentum,
Google’s search functionality greatly improved, and the rest of the project was able to proceed.




                                                                                                          dataversity.net             8
                                                                                                                                      1
Taxonomy
While the term taxonomy is traditionally related to the world of biological species classification, it also plays a similar role in classify-
ing terms for any number of subjects. Information Management taxonomies play a vital role in making sense out of unstructured data,
primarily as a method for organizing metadata.

Taxonomies in IT usually take one of two forms. The first form draws from the species classification origin of the term taxonomy, fol-
lowing a hierarchical tree structure model. Individual terms in this kind of taxonomy have “parents” at the higher levels and “chil-
dren” at corresponding lower levels.

The second taxonomy form is essentially a controlled vocabulary of the terms surrounding any subject matter or system. This might
take the form of a simple glossary or thesaurus, or something more complex and resource intensive, like the creation of a fully-formed
ontology. This second type of taxonomy tends to be more common in the world of information technology.

The ANSI/NISO Z39.19 standard exists for the authoring of taxonomies, information thesauruses, and other organized data dictionar-
ies, illustrating the growing maturity of this data management discipline. Over the last decade, new companies and software packages
centered on taxonomy management have come and gone, with a few earning acclaim as industry leaders, sometimes with taxonomies
dealing with a specific subject matter.

Ultimately, taxonomies, controlled vocabularies, and their similar brethren serve an enterprise by organizing metadata in a fashion that
helps in finding valuable business information out of unstructured data.




   taxonomy CASE STUDY #1


Improving Corporate Intranet Search through the
Development and Deployment of Taxonomies
In the middle of the last decade, a large publically-held electronics company had a problem with unstructured information, estimated
to be about 85 percent of all corporate data. In addition, over 90 percent of the corporate data contained no tagging. Issues with the
duplication of content and determining the true age of documents also hampered the company’s associates when searching for infor-
mation.

To get a clearer picture of the scope of the problem, this company surveyed its employees on their Intranet search habits. The respons-
es demonstrated that the employees wanted a better search interface with more categorization and sorting options, along with a more
streamlined search result set. Some respondents felt frustrated with the difficulty in finding corporate documents through the current
search interface.

A corporate project team was formed with the responsibility of improving Intranet search. The core of this project was the develop-
ment of various taxonomies and controlled vocabularies combined with metadata to improve the categorization of the internal unstruc-
tured information; it was meant to provide benefits for the search interface and in the search results.

Five taxonomies were developed and deployed in the first year of the project. Some were purchased externally and modified to fit the
data at this corporation, while the others were fully developed from within. In all cases, certain employees were tasked as Subject Mat-
ter Experts in their relevant areas for the purposes of creating the most suitable vocabularies and metadata for the taxonomies.

The second year of the project saw the deployment of three additional taxonomies covering the areas of Human Resource, Six Sigma,
and Legal. Two taxonomies were purchased externally and modified to suit, while the other was developed internally. Additionally,
some improvement of the original categorization was done based on feedback after the previous year’s deployments.




                                                                                                           dataversity.net              9
                                                                                                                                        1
The benefits of the taxonomy development were obvious; they significantly improved all aspects of Intranet search: general employees
and electronic engineers wasted less time weeding through bad search results, positively improving productivity, and the company’s
bottom line. With the reduction of duplicated data, storage costs were improved, even when considering new data growth.

Post project surveys and metrics revealed increased use of the improved search and the use of the categories. The general employee
opinions on search improved. Ultimately, the internal development team won an award for their work on the project.




   taxonomy CASE STUDY #2


Folksonomies:
A Hybrid Approach to Taxonomy Development

The creation of taxonomies in any organization comes with associated costs, especially in the case of formal taxonomies where exter-
nal consultants and Subject Matter Experts are involved with the process. In some cases, informal taxonomies exhibit smaller costs,
but with more risk considering the relative lack of taxonomy creation experience compared with a more formalized process.

A hybrid approach utilizes experts in taxonomy creation combined with a user-centric focus on the knowledge modeling for the
project. It attempts to lessen the costs associated with a formal taxonomy, while still providing an organized development process. The
term “folksonomy” is used to describe this more user-centric approach.

Folksonomies leverage crowd-sourced wisdom on the relevant subject matter at hand, an approach not too different from social book-
marking Web sites like Digg or Reddit, or even a blog community focused around a specific subject.

In these kinds of hybrid taxonomy projects, users are able to submit Web sites and/or tags they would like to see included in the over-
all vocabulary. An expert taxonomy team reviews these submissions for appropriateness and uniqueness. Finally, the submissions end
up persisted in XML format for inclusion in the enterprise search process.

The sharing of these internal social bookmarks and tags enhances the quality of enterprise search and also provides insight to the
perception of an organization’s internally facing content. Users feel they are an important stakeholder in the process, thus improving
company morale.

Implementing a folksonomy normally involves installing some form of tagging tool, usually available as a module for an open-source
CMS like Drupal or WordPress. A quality reporting system is also important in determining the efficacy of the tagging, in addition to
providing metrics on how the internal content is being used.

In many cases, a hybrid approach to taxonomy development hits a proverbial sweet spot in combining ROI with lower costs when
compared to a full-fledged formal taxonomy. For smaller organizations, with a limited budget, it is an approach worthy of consider-
ation.



eDiscovery (Electronic Discovery)
eDiscovery providers bring an automated search focus to leveraging valuable information from an organization’s unstructured data.
They are similar to discovery systems used primarily in the library science and research industries. A separate section covering library
discovery systems follows this current section.

eDiscovery is widely used in the legal industry as a means for finding any evidence or information potentially useful in a case. In fact,




                                                                                                         dataversity.net           10
                                                                                                                                    1
court sanctioned hacking of computer systems is a valid form of eDiscovery. Computer forensics, normally used when trying to find
deleted evidence off of a hard drive or other computer storage, is also related to eDiscovery.

Electronic discovery systems are able to search a variety of unstructured data formats, including text, media (images, audio, and
video), spreadsheets, email, and even entire Web sites. The best eDiscovery applications search everything from in-house server farms
to the full breadth of the Internet in their quest to find valuable evidentiary information. Investigators peruse the discovered informa-
tion in a variety of formats that run from printed paper to computer-based browsing.

When potential litigation is a concern, the protection of corporate emails, instant messaging, and even metadata attached to unstruc-
tured documents becomes crucial for any enterprise. This electronically stored information (ESI) became a focal point in changes to
Federal Law in 2006 and 2007 that required organizations to retain, protect, and manage this kind of data.

Yet, a 2010 study showed that only that while 52 percent of organizations have an ESI policy, only 38 percent have tested the policy,
and 45 percent aren’t even aware whether any testing occurred. Considering the risk of future litigation, firms need to focus on the
complete development, including testing, of a robust ESI management policy.

As the eDiscovery industry matures, larger companies are bringing the discovery function in house as opposed to relying on external
vendor-provided systems; other firms prefer a mix between internal and out-sourced solutions. Whatever their ultimate choice, firms
need to realize the vital importance of well-defined and documented procedures for ESI archives and eDiscovery platforms.



Discovery Systems
The library sciences industry also depends on electronic discovery systems to provide research and other functionality to their con-
sumers. An array of vendors and platforms has grown around this form of discovery, residing generally separate from the eDiscovery
sector in the legal industry. Recently, enterprise-based knowledge management initiatives have embraced the traditionally library-
based discovery process.

While discovery systems for library sciences have previously focused on search products for traditional content (e.g. books and peri-
odicals), in recent times their scope has broadened to include a wider range of material, including video, audio, and subscriptions to
electronic resources. Additionally, content from external providers is now able to be discovered in the search.

These discovery systems depend on the indexing of both document metadata along with the full text to provide a robust set of search
results for the user. Content providers benefit from enhanced exposure as well as being able to control the display and delivery of the
discovered materials. Obviously, cooperation between those creating the content and those creating the discovery system is paramount
to ensure the proper indexing of metadata.

With competing discovery system providers, those also producing content sometimes choose to not index their documents with a
competitor. This occurred when EBSCO removed its content from the Ex Libris Primo Central discovery system just before introduc-
ing its own EBSCO Discovery Service. Other pundits worry organizations that both produce content and a discovery service will skew
any search results to favor their own content. Consumers, including libraries, need to take into account these issues before subscribing
to any discovery system.



Text Analytics
Text analytics is another related discipline useful in deriving meaningful information from unstructured data. The term became widely
used around the year 2000 as a more formalized business-based outgrowth of text mining, a technique in use since the 1980s.

In addition to raw text mining, text analytics uses other more formalized techniques, including natural language processing, to turn un-
structured text into data more suitable for Business Intelligence and other analytical uses. Considering that a majority of unstructured




                                                                                                        dataversity.net            11
                                                                                                                                    1
data resides in a textual format, text analytics remains one of the most important techniques for making sense of unstructured data.

Professor Marti Hearst from the University of California, in her 1999 paper Untangling Text Data Mining, presciently described the
current practice of text analytics in today’s business climate:

“For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to
produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collec-
tions to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent
text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.”

Text analytics makes up the basis of some of the previously mentioned methods used in capturing unstructured data, most notably
discovery systems and eDiscovery. It is also employed by organizations to monitor social media for human resources applications or
personally targeted advertising.

Machine learning and semantic processing are two other sub-disciplines of text analytics at the forefront of innovation. IBM’s Watson
computer, famous for defeating Jeopardy! all-time champion Ken Jennings, is an apt example of the power of semantics at the higher
end of computer science.

In addition to natural language and semantic processing, a typical text analytics application might include other techniques or process-
es, such as named entity recognition, which is useful in finding common place names, stock ticker symbols, and abbreviations. Dis-
ambiguation methods can be applied to provide context when faced with different entities sharing the same name: Apple, the Beatles
record company, compared to Apple, the consumer electronics giant.

Regular expression matching is a straightforward process used in parsing phone numbers along with email and Web addresses from
unstructured text. Conversely, sentiment analysis is a more difficult technique, attempting to derive subjective information or human
opinion from text. Machine learning combined with sophisticated syntactic analysis techniques normally make up the basis of auto-
mated sentiment analysis.

Text analytics remains a wide-ranging disciplineused in both the business and scientific worlds. From leading edge applications like
IBM’s Watson to day-to-day tasks like parsing emails from text, it is an important part of capturing unstructured data.




                                                                                                          dataversity.net            12
                                                                                                                                      1
Written by Christine Connors




Integrating Unstructured Data
and Putting it to Use in your
Real World
Christine Connors, Principal, TriviumRLG LLC
Various permutations of text (word processing files, simple text files, emails etc.) make up the    Christine Connors has extensive
largest amount of unstructured data currently in the enterprise. Many firms are in the process      experience in taxonomy, ontology
of implementing unstructured data management projects to find useful information from the           and metadata design and develop-
immensity of corporate email.                                                                       ment. Prior to forming TriviumRLG
                                                                                                    Ms. Connors was the global direc-
Content Management Systems exist partially to help an enterprise manage and derive informa-         tor, semantic technology solutions
tion from the data contained in unstructured text documents. Most of these systems leverage         for Dow Jones, responsible for
metadata to provide an extra layer of classification allowing for easier searches and enhanced      partnering with business champions
reporting.                                                                                          across Dow Jones to improve digital
                                                                                                    asset management and delivery.
Now that you are capturing and storing your data from unstructured sources, what can you do         In that position, she managed a
with it? Where can you put it to good use? What new categories of applications are best suited      worldwide team responsible for the
to exploit it?                                                                                      development of taxonomies, ontolo-
                                                                                                    gies and metadata that are used to
                                                                                                    add value to Dow Jones news and
Asset Valuation                                                                                     financial information products. Ms.
Your organization has created terabytes of intellectual property - what is its value? Value is      Connors also served as business
a difficult thing to assess. A piece of information stored “just in case” today may or may not      champion for the Synaptica® soft-
become the critical missing piece of the puzzle down the road. That is assuming of course that      ware application, including manag-
                                                                                                    ing a US- based team of software
you can both find and use that information.
                                                                                                    developers, and supported Dow
                                                                                                    Jones consulting practices world-
Applying structure to your data will enable two critical processes:
                                                                                                    wide, which deliver end-to-end
	       1) De-duplication and ‘weeding’ of bad or unnecessary data
                                                                                                    information access solutions based
	       2) Visualizing what remains
                                                                                                    on taxonomies, metadata and
                                                                                                    semantic technologies. Prior to join-
Once that weeding has occurred, the costs of data storage can be estimated based upon hard-         ing Dow Jones Ms. Connors was a
ware and maintenance expenses. These costs are then weighed against the value of the digital        knowledge architect at Intuit, where
assets. Value is determined by the alignment of the type and nature of the data against the         she was responsible for introduc-
organization’s core goals. Is the data used as inputs to product or to measure the profits of the   ing semantic technologies to online
products or services being delivered? How many degrees of separation are there? Does the            content management and search.
data identify opportunities or threats in the marketplace? What are the predicted profits and       And before that, she was a Meta-
potential losses?                                                                                   data Architect at Raytheon Com-
                                                                                                    pany and Cybrarian at CEOExpress
Expertise Location                                                                                  Company. At Raytheon Company
                                                                                                    she oversaw knowledge repre-
Consider all of the resumes your HR department has collected; they are a wealth of unstruc-         sentation and enterprise search,
tured data regarding the talents of your employees. Using text analytics, a profile can be built    delivering large-scale taxonomies,
of each person. The application of structure to such a collection of existing data means that       metadata schema and rules-based
project heads can easily identify potential team members who have the experience needed             classification to improve retrieval of
to tackle a particular challenge. Add to that the HR-approved information collected during          petabytes of internal information via
performance reviews, and the team leader has a better handle on what strategies to employ           a multi- vendor retrieval platform.
managing the team’s efforts.



                                                                                                      dataversity.net              13
                                                                                                                                    1
Business Intelligence
It’s always good to know how exterior forces can impact your organization. Perhaps your best customer has joined the board of a non-
profit organization -- a board on which an executive at your top competitor is already a member. How will that new network connec-
tion affect your relationship with your customer’s organization?
Or perhaps one of your top engineers has been asked by her alma mater to work with a lead professor, his students, and another
alumnus on a project that could benefit the school. Could you lose this top performer to a new venture? Is the research worth investing
in? Would you like to set up an alert, using a discovery system, to keep track of the internal memos and external press regarding the
project?


New Product Development
Your researchers and editors have crafted fabulous publications. System analysis reveals that your subscribers are only using bits and
pieces of this published work, reading a page or two, reading only the abstract, or going right to charts and visualizations. How can
you break out these sub-sections of content with minimal overhead? How can you start quickly by using existing content?

By identifying entities in your content, you can re-use or create new graphs, charts, and even user-filtered data visualizations. The enti-
ties are identified by analyzing the existing content, extracted or tagged, and then indexed for re-use.

Doing this work allows you to publish sub-sets of data within a single publication or across publications. You can use wholly owned,
licensed or a combination of content as contractually permitted. You can integrate your data in multi-media tools or social networking
sites.




Requirements for
Unstructured Data Projects
By Christine Connors, Principal, TriviumRLG LLC

As with any undertaking, requirements are needed for an unstructured data project. It isn’t about simply exposing the contents of the
documents. It is about making that content useful to the systems and people who need to use them. Or as many experts have said in
other applications: making the right content available to the right people at the right time.

There may be documents exposed that you didn’t know were there, that shouldn’t be publicly available, and are available because of
an error somewhere in the applications. Imagine the trouble if your new system indexed an HR spreadsheet with salaries, addresses,
and social security numbers, while being backed up onto a shared drive the user thought was secure?

Consider the content collections that will be part of the program. Do you anticipate any of it having restrictions? If so, then what are
those restrictions? How will authorized users authenticate and gain access? Will you restrict access by entity type? By rules-based
classification? By system access and control policies? These are important things to consider.

Given that you might find documents you weren’t expecting, how will you architect the back end to scale effectively? Will it be easily
repeated on additional clusters? What OS and software will it need to run? Will it fail over? Can it scale to handle the number of users,
documents, and entities predicted for the anticipated life of the hardware?

Once you’ve determined that, how will users interact? What will the front end need to provide? Typically users manage Create - Read
- Update - Delete rights as permissioned within a system. They also search, browse, publish, integrate, migrate, and import to and from




                                                                                                          dataversity.net           14
other systems. What tools are needed to support these actions? Should select users be able to perform administrative tasks via a client
or browser interface? How about the ability to generate reports? What operating systems does this interface need to support?

Once you’ve got your content under control, how are you going to package and publish it? What other applications need to use the
data? What are the interoperability requirements?

The ongoing identification of your unstructured data is critical to avoid undertaking such a project again. One method is via Metadata
Management. What requirements do you have there? What kinds of information remain important to manage, in addition to the meta-
data elements? Will you need taxonomy? Do you need an external tool or is there a module within your current CMS, DMS or portal
solution that will suffice?

There are many questions here, but most of the overall process is not too different than anything led by a competent project manager.
These tasks can be completed in parallel or serially in combination with usability tests, surveys, and focus groups. Taxonomy develop-
ment, if needed, will benefit from the guidance of an expert, as it is not typically a linear process like that of software development.




Summing Up the Opportunities
Created by Unstructured Data
The existence of vast quantities of unstructured data at any organization is not necessarily a problem. In fact, it needs to be considered
as an opportunity for success. The various case studies contained within proved that projects focused on deriving information out of
seemingly unrelated data in many cases allowed firms with a proactive attitude to gain a competitive advantage when compared to
firms who fear or ignore such unstructured data.

This paper provided background on the various types of unstructured data along with a collection of time-honed techniques for captur-
ing and managing that data. The real world case studies provided inspiration in solving potential unstructured data issues. The list of
applications and organizations with tools related to unstructured data are an excellent starting point for researching the wide issues in
this sector of data management.

Additionally, the included survey data revealed that most firms are taking an active approach with unstructured data, so no one should
feel alone when considering their own data issues. It is a fascinating area of expertise that continues to change and evolve; there are
many opportunities for organizational success, personal success, and knowledge growth, with so many transformations occurring all
the time.




                                                                                                         dataversity.net            15
Appendix: Open Source and Commercial
Applications around Unstructured Data
Teradata: Teradata produces enterprise Business Intelligence and Data Warehousing software. Their suite also provides functional-
ity that facilitates data extraction from various unstructured and structured sources into a proprietary relational database. The company
recently added text analytics (from Attensity) to its software suite to better analyze certain types of unstructured data, including text
documents and spreadsheets.

Teradata Corporation
10000 Innovation Drive
Dayton, OH 45342
Phone: (866) 548-8348

Attensity: Attensity specializes in the extraction of meaning from unstructured data through the use of text analytics. They recently
partnered with Data Warehousing provider, Teradata, to add text analysis to the latter’s software suite.

Attensity Group
2465 East Bayshore Road
Suite 300
Palo Alto, CA 94303
Phone: (650) 433-1700

DataStax: DataStax is known as a leader for commercial implementations of the Apache Cassandra database. Last year’s introduc-
tion of DataStax Enterprise combines Cassandra with the company’s OpsCenter product, all running on the Hadoop framework.

DataStax HQ - SF Bay Area
777 Mariners Island Blvd #510
San Mateo, CA 94404
Phone: (650) 389-6000


Taxonomy Providers
Smartlogic: Smartlogic is an enterprise taxonomy provider primarily known for Semaphore, a content intelligence platform prom-
ising improved control of and easier access to an organization’s unstructured data.

Smartlogic US
560 S. Winchester Blvd, Suite 500
San Jose, California, 95128
Phone: (408) 213-9500
Fax: (408) 572-5601

Synaptica: Synaptica is an innovation leader in the areas of enterprise taxonomy and metadata. Their platform integrates with Mi-
crosoft SharePoint, providing a method to store Synaptica’s taxonomy within SharePoint, facilitating unstructured document search.

Synaptica, LLC
11384 Pine Valley Drive
Franktown, CO 80116
Phone: (303) 298-1947




                                                                                                        dataversity.net            16
                                                                                                                                    1
Teragram: Recently acquired by SAS, Teragram is an expert in the world of linguistic search. They help organizations to better
manage unstructured content in a variety of languages allowing an enterprise to further develop its international presence.

SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513-2414
Phone: (919) 677-8000


Content Management Systems
Documentum: Documentum remains one of the largest content management system (CMS) platforms in the industry. The software
facilitates the management of business documents as well as a host of other unstructured data types, including images, audio, and
video. Documentum is now owned by IT services conglomerate, EMC Corporation.

EMC Corporation
176 South Street
Hopkinton, MA 01748
Phone: (866) 438-3622

MarkLogic: MarkLogic makes enterprise software to help organizations manage unstructured data. Their system is based on the use
of XQuery for fast retrieval of documents marked up with metadata in XML format, and thus scales nicely when accessing Big Data
stores.

MarkLogic Corporation Headquarters
999 Skyway Road, Suite 200
San Carlos, CA 94070
Phone: (877) 992-8885

WordPress: WordPress is one of the most popular open source blogging and content management platforms. A robust community
has grown around the platform which leverages the MySQL and PHP open source solutions for database and scripting functionality.

Drupal: Drupal is another open source content management platform, but without the self-blogging focus of WordPress.


Discovery Systems
Verity K2: K2 is an enterprise search platform, or discovery system, used by organizations wanting to provide intelligent searching of
the mass of corporate unstructured data. Verity was acquired by Autonomy, which in turn was recently acquired by HP.

Autonomy US Headquarters
One Market Plaza
Spear Tower, Suite 1900
San Francisco, CA 94105
Phone: (415) 243 9955

Serials Solutions’ Summon Service: The Summon service from Serial Solutions is a Web-scale discovery system used pri-
marily by libraries. Summon provides search functionality on a full range of media, including audio, video, and e-content, in addition
to books.




                                                                                                       dataversity.net           17
                                                                                                                                  1
Serial Solutions North America
501 North 34th Street
Suite 300
Seattle, WA 98103-8645
Phone: (866) SERIALS (737-4257)

EBSCO Discovery Service: EBSCO’s Discovery Service facilitates discovery of an institution’s resources by combining pre-
indexed metadata from both internal and external sources to create a uniquely tailored search solution known for its speed. Although
known for their database and e-book services, EBSCO’s Discovery Service primarily supports research institutions and libraries.

EBSCO Publishing
10 Estes Street
Ipswich, MA 01938
Phone: (800) 653-2726 (USA  Canada)
Fax: (978) 356-6565

OCLC WorldCat Local: WorldCat Local is a library-based discovery system provided by the Online Computer Library Center
(OCLC). The system provides single search box access to over 922 million items from library collections worldwide. OCLC is also
the organization responsible for first developing the Dublin Core Metadata Initiative.

OCLC Headquarters
6565 Kilgour Place
Dublin, OH 43017-3395
Phone: (614) 764-6000
Toll Free: (800) 848-5878 (USA and Canada only)
Fax: (614) 764-6096

Ex Libris Primo Central: The Primo Central Index is the centerpiece of Ex Libris’ Primo Discovery and Delivery discovery
system focused on providing access for the research scholar audience to hundreds of millions of documents. Ex Libris is a leading
provider of automation solutions for the library sciences market.

Ex Libris
1350 E Touhy Avenue, Suite 200 E
Des Plaines, IL 60018
Phone: (847) 296-2200
Fax: (847) 296-5636
Toll Free: (800) 762-6300
	

Metadata
Dublin Core Metadata Initiative: The Dublin Core Metadata Initiative is a metadata collection primarily used by libraries and
education institutions worldwide. It was originally developed by the Online Computer Library Center (OCLC), located in Dublin OH.

Esri ArcCatalog: Esri’s ArcCatalog is a tool within their ArcGIS software suite used for the development and management of
GIS-related metadata. Esri is a worldwide leader in the management of geographic data.

Esri Headquarters
380 New York Street
Redlands, CA 92373-8100
Phone: (909) 793-2853




                                                                                                      dataversity.net           18
                                                                                                                                 1
SchemaLogic MetaPoint: MetaPoint is a metadata tagging and management tool developed by SchemaLogic. Its primary audi-
ence is companies with large investments in the Microsoft-based document tools, Office, and SharePoint. MetaPoint promises to pro-
vide the missing connection between the two software products. SchemaLogic was recently acquired by taxonomy systems provider,
Smartlogic.

Smartlogic US
560 S. Winchester Blvd, Suite 500
San Jose, California, 95128
Phone: (408) 213-9500
Fax: (408) 572-5601



about dataversity
We provide a centralized location for training, online webinars, certification, news and more for information technology (IT) pro-
fessionals, executives and business managers worldwide. Members enjoy access to a deeper archive, leaders within the industry,
knowledge base and discounts off many educational resources including webinars and data management conferences. For questions,
feedback, ideas on future topics, or for more information please visit: http://www.dataversity.net/, or email: info@dataversity.net.




                                                                                                      dataversity.net           19

Mais conteúdo relacionado

Mais procurados

The Path to Data and Analytics Modernization
The Path to Data and Analytics ModernizationThe Path to Data and Analytics Modernization
The Path to Data and Analytics ModernizationAnalytics8
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementDATAVERSITY
 
Get More From Your Data with Splunk AI + ML
Get More From Your Data with Splunk AI + MLGet More From Your Data with Splunk AI + ML
Get More From Your Data with Splunk AI + MLSplunk
 
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...Jochem van Grondelle
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
What Is My Enterprise Data Maturity 2021
What Is My Enterprise Data Maturity 2021What Is My Enterprise Data Maturity 2021
What Is My Enterprise Data Maturity 2021DATAVERSITY
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse RequirementsDavid Walker
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Amazon Web Services
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
How Retail Banks Use MongoDB
How Retail Banks Use MongoDBHow Retail Banks Use MongoDB
How Retail Banks Use MongoDBMongoDB
 
Data Marketplace and the Role of Data Virtualization
Data Marketplace and the Role of Data VirtualizationData Marketplace and the Role of Data Virtualization
Data Marketplace and the Role of Data VirtualizationDenodo
 
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...Flink Forward
 

Mais procurados (20)

The Path to Data and Analytics Modernization
The Path to Data and Analytics ModernizationThe Path to Data and Analytics Modernization
The Path to Data and Analytics Modernization
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data Management
 
Get More From Your Data with Splunk AI + ML
Get More From Your Data with Splunk AI + MLGet More From Your Data with Splunk AI + ML
Get More From Your Data with Splunk AI + ML
 
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
AI in security
AI in securityAI in security
AI in security
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
What Is My Enterprise Data Maturity 2021
What Is My Enterprise Data Maturity 2021What Is My Enterprise Data Maturity 2021
What Is My Enterprise Data Maturity 2021
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse Requirements
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
How Retail Banks Use MongoDB
How Retail Banks Use MongoDBHow Retail Banks Use MongoDB
How Retail Banks Use MongoDB
 
Data Marketplace and the Role of Data Virtualization
Data Marketplace and the Role of Data VirtualizationData Marketplace and the Role of Data Virtualization
Data Marketplace and the Role of Data Virtualization
 
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
 

Destaque

Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BIMonaheng Diaho
 
The Business Case for Robotic Process Automation (RPA)
The Business Case for Robotic Process Automation (RPA)The Business Case for Robotic Process Automation (RPA)
The Business Case for Robotic Process Automation (RPA)Joe Tawfik
 
Introduction to Object Storage Solutions White Paper
Introduction to Object Storage Solutions White PaperIntroduction to Object Storage Solutions White Paper
Introduction to Object Storage Solutions White PaperHitachi Vantara
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringData-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringDATAVERSITY
 
SA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentSA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentJohn Breslin
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extractionR A Akerkar
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehousephanleson
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerbohanairl
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
HfS Webinar Slides: Smart Process Automation in Enterprise Business
HfS Webinar Slides: Smart Process Automation in Enterprise BusinessHfS Webinar Slides: Smart Process Automation in Enterprise Business
HfS Webinar Slides: Smart Process Automation in Enterprise BusinessHfS Research
 
Emotion detection from text using data mining and text mining
Emotion detection from text using data mining and text miningEmotion detection from text using data mining and text mining
Emotion detection from text using data mining and text miningSakthi Dasans
 
Siri voice controlled personal assistant
Siri  voice controlled personal assistantSiri  voice controlled personal assistant
Siri voice controlled personal assistantharoonrashidlone
 
Virtual personal assistant
Virtual personal assistantVirtual personal assistant
Virtual personal assistantShubham Bhalekar
 
Structured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebookStructured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebookEmcien Corporation
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 

Destaque (20)

Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured Data
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BI
 
The Business Case for Robotic Process Automation (RPA)
The Business Case for Robotic Process Automation (RPA)The Business Case for Robotic Process Automation (RPA)
The Business Case for Robotic Process Automation (RPA)
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Introduction to Object Storage Solutions White Paper
Introduction to Object Storage Solutions White PaperIntroduction to Object Storage Solutions White Paper
Introduction to Object Storage Solutions White Paper
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringData-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering
 
LMS Focus Group Report
LMS Focus Group ReportLMS Focus Group Report
LMS Focus Group Report
 
SA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentSA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated Content
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
Text MIning
Text MIningText MIning
Text MIning
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehouse
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMiner
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
How Siri Works
How Siri WorksHow Siri Works
How Siri Works
 
HfS Webinar Slides: Smart Process Automation in Enterprise Business
HfS Webinar Slides: Smart Process Automation in Enterprise BusinessHfS Webinar Slides: Smart Process Automation in Enterprise Business
HfS Webinar Slides: Smart Process Automation in Enterprise Business
 
Emotion detection from text using data mining and text mining
Emotion detection from text using data mining and text miningEmotion detection from text using data mining and text mining
Emotion detection from text using data mining and text mining
 
Siri voice controlled personal assistant
Siri  voice controlled personal assistantSiri  voice controlled personal assistant
Siri voice controlled personal assistant
 
Virtual personal assistant
Virtual personal assistantVirtual personal assistant
Virtual personal assistant
 
Structured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebookStructured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebook
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 

Semelhante a Unstructured Data and the Enterprise

Dell Data Migration A Technical White Paper
Dell Data Migration  A Technical White PaperDell Data Migration  A Technical White Paper
Dell Data Migration A Technical White Papernomanc
 
Data Science & BI Salary & Skills Report
Data Science & BI Salary & Skills ReportData Science & BI Salary & Skills Report
Data Science & BI Salary & Skills ReportPaul Buzby
 
Benefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
Benefits of Modern Cloud Data Lake Platform Qubole GCP - WhitepaperBenefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
Benefits of Modern Cloud Data Lake Platform Qubole GCP - WhitepaperVasu S
 
Concorde_TechBooklet_6.1.16
Concorde_TechBooklet_6.1.16Concorde_TechBooklet_6.1.16
Concorde_TechBooklet_6.1.16Kelly Knight
 
White Paper: Gigya's Information Security and Data Privacy Practices
White Paper: Gigya's Information Security and Data Privacy PracticesWhite Paper: Gigya's Information Security and Data Privacy Practices
White Paper: Gigya's Information Security and Data Privacy PracticesGigya
 
Information extraction systems aspects and characteristics
Information extraction systems  aspects and characteristicsInformation extraction systems  aspects and characteristics
Information extraction systems aspects and characteristicsGeorge Ang
 
Robust integration with tivoli directory integrator 7.0 redp4672
Robust integration with tivoli directory integrator 7.0 redp4672Robust integration with tivoli directory integrator 7.0 redp4672
Robust integration with tivoli directory integrator 7.0 redp4672Banking at Ho Chi Minh city
 
Robust integration with tivoli directory integrator 7.0 redp4672
Robust integration with tivoli directory integrator 7.0 redp4672Robust integration with tivoli directory integrator 7.0 redp4672
Robust integration with tivoli directory integrator 7.0 redp4672Banking at Ho Chi Minh city
 
Ad Ch.1 8 (1)
Ad Ch.1 8 (1)Ad Ch.1 8 (1)
Ad Ch.1 8 (1)gopi1985
 
7 Development Projects With The 2007 Microsoft Office System And Windows Shar...
7 Development Projects With The 2007 Microsoft Office System And Windows Shar...7 Development Projects With The 2007 Microsoft Office System And Windows Shar...
7 Development Projects With The 2007 Microsoft Office System And Windows Shar...LiquidHub
 
The Total Book Developing Solutions With EPiServer 4
The Total Book Developing Solutions With EPiServer 4The Total Book Developing Solutions With EPiServer 4
The Total Book Developing Solutions With EPiServer 4Martin Edenström MKSE.com
 
Java script tools guide cs6
Java script tools guide cs6Java script tools guide cs6
Java script tools guide cs6Sadiq Momin
 

Semelhante a Unstructured Data and the Enterprise (20)

Dell Data Migration A Technical White Paper
Dell Data Migration  A Technical White PaperDell Data Migration  A Technical White Paper
Dell Data Migration A Technical White Paper
 
Data Science & BI Salary & Skills Report
Data Science & BI Salary & Skills ReportData Science & BI Salary & Skills Report
Data Science & BI Salary & Skills Report
 
Benefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
Benefits of Modern Cloud Data Lake Platform Qubole GCP - WhitepaperBenefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
Benefits of Modern Cloud Data Lake Platform Qubole GCP - Whitepaper
 
Manual Civil 3d Ingles
Manual Civil 3d InglesManual Civil 3d Ingles
Manual Civil 3d Ingles
 
Ftk 1.80 manual
Ftk 1.80 manualFtk 1.80 manual
Ftk 1.80 manual
 
Concorde_TechBooklet_6.1.16
Concorde_TechBooklet_6.1.16Concorde_TechBooklet_6.1.16
Concorde_TechBooklet_6.1.16
 
Reqpro user
Reqpro userReqpro user
Reqpro user
 
White Paper: Gigya's Information Security and Data Privacy Practices
White Paper: Gigya's Information Security and Data Privacy PracticesWhite Paper: Gigya's Information Security and Data Privacy Practices
White Paper: Gigya's Information Security and Data Privacy Practices
 
Information extraction systems aspects and characteristics
Information extraction systems  aspects and characteristicsInformation extraction systems  aspects and characteristics
Information extraction systems aspects and characteristics
 
Microsoft Office First Look
Microsoft Office First LookMicrosoft Office First Look
Microsoft Office First Look
 
Robust integration with tivoli directory integrator 7.0 redp4672
Robust integration with tivoli directory integrator 7.0 redp4672Robust integration with tivoli directory integrator 7.0 redp4672
Robust integration with tivoli directory integrator 7.0 redp4672
 
Robust integration with tivoli directory integrator 7.0 redp4672
Robust integration with tivoli directory integrator 7.0 redp4672Robust integration with tivoli directory integrator 7.0 redp4672
Robust integration with tivoli directory integrator 7.0 redp4672
 
Ad Ch.1 8 (1)
Ad Ch.1 8 (1)Ad Ch.1 8 (1)
Ad Ch.1 8 (1)
 
7 Development Projects With The 2007 Microsoft Office System And Windows Shar...
7 Development Projects With The 2007 Microsoft Office System And Windows Shar...7 Development Projects With The 2007 Microsoft Office System And Windows Shar...
7 Development Projects With The 2007 Microsoft Office System And Windows Shar...
 
Hfm user
Hfm userHfm user
Hfm user
 
The Total Book Developing Solutions With EPiServer 4
The Total Book Developing Solutions With EPiServer 4The Total Book Developing Solutions With EPiServer 4
The Total Book Developing Solutions With EPiServer 4
 
R data
R dataR data
R data
 
Microsoft Office 2010.pdf
Microsoft Office 2010.pdfMicrosoft Office 2010.pdf
Microsoft Office 2010.pdf
 
Office 2010 e book
Office 2010   e bookOffice 2010   e book
Office 2010 e book
 
Java script tools guide cs6
Java script tools guide cs6Java script tools guide cs6
Java script tools guide cs6
 

Mais de DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

Mais de DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Unstructured Data and the Enterprise

  • 1. Second Quarter 2012 Headline subhead Various permutations of text (word processing files, simple text files, emails etc.) make up the largest amount of unstructured data cur- rently in the enterprise. Many firms are in the process of implementing unstructured data management projects to find useful informa- tion from the immensity of corporate email. Content Management Systems exist partially to help an enterprise manage and derive information from the data contained in unstruc- tured text documents. Most of these systems leverage metadata to provide an extra layer of classification allowing for easier searches and enhanced reporting. Unstructured Data and the Enterprise by Paul Williams Collaborated with Christine Connors ™ dataversity.net 1
  • 2. table of contents Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Growing Industry around Unstructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Current Usage of Unstructured Data in the Enterprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . Unstructured Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Text Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . Media Formats (Audio, Video, Images) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . Office Software Data Formats (PowerPoint, MS Project). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Capturing and Managing Unstructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Data Scraping and Text Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 eDiscovery (electronic discovery) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Discovery Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 . . Text Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Insert By Christine Connors, Principal, TriviumRLG LLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . Integrating Unstructured Data and Putting it to Use in your Real World. . . . . . . . . . . . . . . . . . . . . . . . 13 Requirements for Unstructured Data Projects (Christine Connors, Principal, TriviumRLG LLC) . . . . . . . . . 14 Summing Up the Opportunities Created by Unstructured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Open Source and Commercial Applications around Unstructured Data. . . . . . . . . . . . . . . . . . . . . . . . 16 Taxonomy Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Content Management Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Discovery Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . Metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 About Dataversity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 dataversity.net 1
  • 3. ™ Executive Summary In its most basic definition, unstructured data simply means any form of data that does not easily fit into a relational model or a set of database tables. Unstructured data exists in a variety of formats: books, audio, video, or even a collection of documents. In fact, some of this data may very well contain a measure of structure, such as chapters within a novel or the markup on a HTML Web page, but not a full data model typical of relational databases. ➺ Anywhere from 40 to 80 percent of an enterprise’s stored data currently resides in an unstructured format. While most unstructured data is in various text formats, other formats include audio, video, Web pages, and office software data. ➺ Firms able to successfully capture and manage their unstructured data hold a competitive advan- tage over firms unable to do the same. Over 90 percent of enterprises are currently planning to manage unstructured data or are already doing it. ➺ Metadata markup, text analytics, data mining, and taxonomy creation are all industry-proven tech- niques used to capture an enterprise’s unstructured data. Included case studies show these techniques in action as part of success- ful problem-solving projects. ➺ Difficulty in integration with enterprise systems along with the immaturity of the currently available unstructured data management tools are two major barriers to the successful management of unstructured data. Many industry pundits claim that 80 to 85 percent of all enterprise data resides in an unstructured format. Some studies counter that statistic, arguing the true percentage is significantly less. DATAVERSITY’s own 2012 survey reflects a percentage below the common- ly stated 80 percent. No matter the actual percentage compared to structured formats, there is little doubt the amount of unstructured data continues to grow. Gartner Research, one organization quoting the 80 percent metric for unstructured data, predicts a nearly 800 percent growth in the amount of enterprise data over the next 5 years. The large majority of this data is expected to be unstructured. This data growth is one factor driving corporate investments in Big Data, Cloud Computing, and NoSQL. dataversity.net 1
  • 4. The Growing Industry around Unstructured Data A large industry has grown around the task of deriving valuable business information out of unstructured data. With parsers scraping information out of pages and pages of What industry text, in addition to full systems built around data taxonomy and discovery, many op- are you in? tions exist for any enterprise trying to make sense of their unstructured information. 0% 5% 10% 15% 20% TECHNOLOGY/COMPUTERS/SOFTWARE Commercially available Content Management Systems (CMS) use metadata to pro- GOVERNMENT vide more accurate searching of an organization’s document library. In fact, metadata OTHER (PLEASE SPECIFY) IT CONSULTING SERVICES remains a very important tool in the handling of any unstructured data. Business Intel- EDUCATION ligence system providers also offer mature solutions around making sense of the vast SOFTWARE/SYSTEM DEVELOPMENT HEALTHCARE arrays of an enterprise’s seemingly unrelated information. TELECOMMUNICATIONS FINANCIAL SERVICES INSURANCE Despite any associated costs, enterprises better able to leverage meaningful Business MANUFACTURING Intelligence from unstructured data gain a competitive advantage over those compa- RETAIL/WHOLESALE/DISTRIBUTION ENTERTAINMENT nies who cannot. MEDIA MARKETING PUBLISHING Once that data is successfully under management, finding the best techniques and ap- plications for leveraging the data remains key in deriving a competitive advantage for Graph #1 any organization. Current Usage of Unstructured number of employees Data in the Enterprise in your company? DATAVERSITY recently sent out a survey to its readership base covering a host of 0% 5% 10% 15% 20% 25% topics related to unstructured data in the enterprise. With nearly 400 respondents, the LESS THAN 10 answers provide a reliable cross section of industry types and organizational sizes. 10-100 100-1,000 The survey results provide a unique insight as to how organizations perceive their 1,000-5,000 5,000-10,000 unstructured data problem, what steps they are taking to derive meaningful business 10,000-50,000 information from that data, what formats the data resides in, and finally some demo- OVER 50,000 graphic information about their job function, organization, and industry. Graph #2 Referencing the organizational demographics of the survey respondents provides a sense of the types and sizes of the companies attempting to manage unstructured data. In short, a wide cross section of organizational sizes and industries are currently man- aging or considering implementing projects to manage unstructured data. See Graphs One and Two below: dataversity.net 2
  • 5. Documents are the unstructured data type most commonly under management or considered for management, followed closely by emails, presentations, and scanned documents. Various media formats (images, audio, and video) and social media chatter are also important. See Graph Three below: What Types of Unstructured Data Does Your Organization Currently Manage, or is Considering for Management? 0% 20% 40% 60% 80% 100% DOCUMENTS EMAIL SCANNED DOCUMENTS OR IMAGES PRESENTATIONS SOCIAL MEDIA CHATTER DRAWING/GRAPHICS VIDEO AUDIO PHOTOGRAPHS GEOSPATIAL IMAGES NONE - WE HAVE NO PLANS OTHER (PLEASE SPECIFY) Graph #3 Most organizations hope to improve business efficiencies and reduce costs at their organization through the successful management of unstructured data. Additional potential benefits from unstructured data management include the improvement of communication, deriving sales leads, meeting compliance requirements, improving data governance, improving enterprise search, parsing and analyz- ing social media channels, and Business Intelligence system integration. The main barriers to improving unstructured data at the enterprise reflect the growing maturity of the practice. Two major barriers to the successful management of unstructured data are difficulty in integration with enterprise systems along with the immaturity of the currently available unstructured data management tools. Cost, lack of executive support, and a lack of unstructured data knowledge among IT staff are also major barriers. See Graph Four on the following page. what are the main barriers to effectively improving the management of unstructured data within your company? 0% 10% 20% 30% 40% 50% DIFFICULTY OF INTEGRATING UNSTRUCTURED DATA WITH OTHER ENTERPRISE SYSTEMS LACK OF MATURE TOOLS/SYSTEMS FOR ENTERPRISE MANAGEMENT REQUIREMENTS ARE NOT WELL DEFINED SHORTAGE OF SKILLS/KNOWLEDGE AMONGST EXISTING IT STAFF COMPLEXITY OF SOLUTIONS HIGH COST OF AVAILABLE TECHNOLOGIES AND SOLUTIONS TOO MUCH TO DO, TOO LITTLE TIME LACK OF EXECUTIVES OR BUSINESS-LEVEL SUPPORT BUSINESS CASE IS NOT UNDERSTOOD, OR NOT SUFFICIENTLY COMPELLING LACK OF AUTOMATED TOOLS LACK OF GOVERNANCE SUPPORT IN TOOLS/SYSTEMS Graph #4 dataversity.net 3
  • 6. System integration with existing structured data appears to be the key technology challenge of the unstructured data management process across most industries (see Graphic Five below). Determining project requirements is another key challenge, especially in the insurance and entertainment industries, as well as improving semantic consistency between multiple systems. Other relevant chal- lenges include security, taxonomy development, scalability, and search result integration across multiple platforms. What are the key technology challenges you have encountered when attempting to build systems or applications for managing Unstructured Data? IT Consulting Services Retail / Wholesale / Computers / Software Telecommunications Software / System Financial Services Manufacturing Industry Other Entertainment Development Government Technology / Distribution Healthcare Marketing Publishing Education Insurance Media Industry Determining Project Requirements 55.9% 47.1% 32.1% 26.7% 33.3% 55.6% 16.7% 33.3% 37.5% 25.0% 50.0% 75.0% 100.0% 35.7% 25.0% 47.4% Methods for integrating with 58.8% 64.7% 56.6% 26.7% 41.7% 50.0% 33.3% 73.3% 62.5% 100.0% 100.0% 75.0% 50.0% 39.3% 20.0% 47.7% existing structured data Natural language 11.8% 29.4% 17.0% 13.3% 20.8% 27.8% 33.3% 13.3% 25.0% 0.0% 25.0% 25.0% 50.0% 32.1% 40.0% 18.4% understanding Entity and/or concept extraction and 29.4% 23.5% 35.8% 13.3% 20.8% 22.2% 33.3% 0.0% 31.3% 0.0% 50.0% 25.0% 50.0% 25.0% 30.0% 26.3% analysis Filtering / summarizing relevant 23.5% 23.5% 35.8% 40.0% 33.3% 27.8% 33.3% 60.0% 25.0% 50.0% 25.0% 0.0% 50.0% 21.4% 40.0% 34.2% information Semantic consistency across multiple 47.1% 23.5% 41.5% 33.3% 37.5% 44.4% 0.0% 26.7% 37.5% 0.0% 75.0% 75.0% 50.0% 42.9% 40.0% 36.8% systems High storage requirements 17.6% 17.6% 28.3% 26.7% 25.0% 33.3% 66.7% 26.7% 25.0% 25.0% 25.0% 25.0% 50.0% 10.7% 35.0% 28.9% Poor system performance 11.8% 23.5% 26.4% 20.0% 16.7% 22.2% 16.7% 20.0% 25.0% 0.0% 25.0% 0.0% 100.0% 21.4% 20.0% 15.8% at large scale System modeling and design 17.6% 29.4% 18.9% 13.3% 20.8% 33.3% 16.7% 40.0% 37.5% 25.0% 75.0% 25.0% 50.0% 14.3% 30.0% 21.1% Digital Rights management 20.6% 5.9% 9.4% 6.7% 12.5% 5.6% 0.0% 6.7% 0.0% 0.0% 25.0% 0.0% 50.0% 7.1% 5.0% 5.3% Taxonomy development 35.3% 23.5% 24.5% 0.0% 25.0% 11.1% 0.0% 40.0% 37.5% 0.0% 50.0% 50.0% 50.0% 25.0% 25.0% 23.7% Integrating search results across 23.5% 23.5% 26.4% 20.0% 16.7% 27.8% 0.0% 26.7% 56.3% 25.0% 75.0% 0.0% 50.0% 21.4% 25.0% 47.4% multiple platforms Security 32.4% 35.3% 17.0% 53.3% 37.5% 16.7% 16.7% 26.7% 31.3% 0.0% 25.0% 50.0% 50.0% 28.6% 15.0% 28.9% Information Quality problems 32.4% 17.6% 18.9% 40.0% 20.8% 44.4% 0.0% 20.0% 31.3% 25.0% 0.0% 25.0% 50.0% 25.0% 45.0% 42.1% dataversity.net 4
  • 7. Unstructured Data Formats Text Documents Various permutations of text (word processing files, simple text files, emails etc.) make up the largest amount of unstructured data cur- rently in the enterprise. Many firms are in the process of implementing unstructured data management projects to find useful informa- tion from the immensity of corporate email. Content Management Systems exist partially to help an enterprise manage and derive information from the data contained in unstruc- tured text documents. Most of these systems leverage metadata to provide an extra layer of classification allowing for easier searches and enhanced reporting. Web Pages Web pages are unique in the world of unstructured data. In fact, an argument can be made that HTML markup itself provides a mea- sure of structure. Web sites that are primarily data-driven might even use a fully normalized database as their back end. In addition to pure HTML markup, many of these Web-based applications are written in Java, PHP, .NET and other development frameworks that render HTML output. The proprietary algorithms utilized by search engine providers use HTML meta tags in addition to in-text keyword analysis to help tailor search results for the user. Of course, some of that same logic also works in generating Internet advertising for Web surfers, with the ad revenue leading to financial success for the advertising providers. The rapid growth of social media and the interactive Web is also creating large amounts of unstructured data as well as opportunities for those companies able to manage and derive value from that data. The rising popularity of graph databases optimized for finding relationships between social media users and their consumer preferences (along with others in their social networks) is arguably due to companies hoping to leverage revenue from unstructured data analysis. Media Formats (Audio, Video, Images) Audio, video, and image files are all forms of unstructured data. Intelligent real-time analysis of audio data is commonplace in digital audio recording and processing, and also used to a lesser extent with the other two formats. Metadata is a must with media files; it provides necessary additional classification. A trip around the iTunes music store provides an excellent opportunity to see this metadata in action, as band names, genres, and related artists drive Apple’s Genius and other music recommendation services like the Pandora Internet radio station. Office Software Data Formats (PowerPoint, MS Project) Files created using Microsoft Office or any other office suite run the gamut of data formats. Microsoft Access creates and manages fully structured database files in its own format. Excel and PowerPoint files both provide challenges to organizations looking to include information from these formats in their corporate reporting mechanisms. Microsoft’s VBA language provides a measure of functionality in parsing meaningful information from Office files. Commercial applications with proprietary data formats include various customer management (CRM) tools, larger enterprise resource planning (ERP) applications like SAP, or even architectural drafting applications like AutoCAD. In some cases, this software lever- ages a semi-structured data format such as XML for data exchange between applications. dataversity.net 5 1
  • 8. Capturing and Managing Unstructured Data Before any enterprise can derive meaningful information from the mass of unstructured data, the process to capture that data needs consideration. Data scraping or text parsing involves the extraction of information from data at its most basic level, but other methods and tools provide a more measured approach. Data Scraping and Text Parsing Data Scraping is a technique where human-readable information is extracted from a computer system by another program. It is impor- tant to distinguish the “human readable” aspect of data scraping from the typical data exchange between computers which can involve structured formats. The technique first came into vogue with screen scraping, used during the advent of client-server computing, as many organizations struggled with the volume of data residing in legacy mainframe applications. The data from these mainframe apps, parsed from termi- nal screens, was usually imported into some form of reporting or Business Intelligence application. Data Scraping can also be neces- sary when attempting to interface to any legacy system without an application programming interface (API). In recent times, Web scraping has followed a similar technique as screen scraping, allowing meaningful data from Web pages to be parsed, scrubbed, and stored in a relational database. There are currently many applications using some form of Web scraping, espe- cially in the areas of Web mashup and Web site integration, although in many cases APIs provide a more structured process. Report Mining is similar to data and Web scraping in that it involves deriving meaningful information from a collection of static, human-readable reports. This technique can also be useful in an enterprise’s software development QA process, such as facilitating regression test result analysis. Data Scraping and Text Parsing Case Study A Fledgling Financial Services Company Leverages Technology; Plays with the Big Boys In the mid-1990s, equipped with investment dollars, ARM Financial Group went on a buying binge, acquiring a collection of small insurance companies that specialized in retirement products, such as annuities and structured settlements. Its short-term quest focused on rapidly increasing the dollar value of its assets under management and then profiting from improved efficiencies around the admin- istration of these assets. Getting accounting systems under control is normally the first step when acquiring any company in the financial sector. This is usually followed by bringing the policy administration systems online with the new company. ARM faced a dilemma concerning the older mainframe systems from a newly acquired firm. The company made the choice to go client-server for their annuity processing systems; they were one of the first firms in the insurance sector to do so. While massive projects grew around converting policy records from the old mainframe system to the new client-serv- er based system, a more elegant solution was quickly developed to handle the customer service role. dataversity.net 6 1
  • 9. Screen scraping was used to grab policy data from mainframe terminals and store the data in a simple relational database, providing customer service agents with a means to access policy data when on the phone with policyholders. This screen scraping project was online months before all the policy data were converted and stored into the new client-server system’s database. While the policy data in the older mainframe system was in a structured format, the use of screen scraping allowed the rapid develop- ment of a needed solution for the customer service function at the new company. As ARM continued to grow, improving operational efficiency became vital in maximizing profits. Despite the company’s state-of-the- art, Web-based system for new annuities, nearly all business was in the form of paper -- both for new policies and changes to existing policies. Implementing an imaging and workflow system was crucial in re-engineering processes for controlling all aspects of an annuity’s policy lifecycle. With most imaging systems, automated text parsing is vital in grabbing information from a document and storing it in some form of database. The workflow elements of the new imaging system at the financial services company were also highly dependent on the manual scraping of important metadata from essentially unstructured paper policy documents. This metadata was essential when routing a policy through the processing workflows at the company. The imaging and workflow sys- tem also used a SQL Server database to persist important data captured from the policy. While some of the policy data was the same as what was stored in the annuity administration system, having it also stored in the workflow system with a SQL Server back end allowed the rapid development of applications in Visual Basic to support customer service and policy management functions. These new systems proved to be an early form of Business Intelligence application made possible by the leveraging of both manual and automated screen scraping and text parsing to derive useful information from a mass of unstructured paper documents. In the late 1990s the concepts around what is now called unstructured data were still in their infancy. The technological successes of a small financial services company in pioneering insurance processing allowed it to compete with much larger firms. These successes had a lot to do with recognizing what concepts like screen scraping and text parsing could do to make annuity processing more ef- ficient by allowing the rapid development of new systems to support both customer service and management roles. Metadata Sometimes defined as “data about data,” metadata is often a useful tool when dealing with unstructured data residing in text docu- ments. In some cases, metadata serves a similar role as a data dictionary by describing the content of data, while other times it is used in more of a markup or tagging purpose. While some metadata is embedded directly in the document it describes, metadata can also be stored in a database or some other re- pository. Many enterprises build a formal metadata registry with limited write access. This registry can be shared internally or exter- nally through a Web service interface. Many metadata standards specific to various industries have developed over time. One example of this is the Dublin Core Metadata Initiative (DCMI), used extensively in library sciences and other disciplines for the purposes of online resource discovery. Metadata schema syntax can be in a variety of text or markup formats, including HTML, XML, or even plain text. Both ANSI and ISO are active in developing and enforcing standards for expressing metadata syntax. A type of metadata, meta tags are used in HTML to mark a Web page with words describing the content of the page. In the past, search engines relied mostly on meta tags when building result sets based on a search query, although their relevance has waned in recent times. dataversity.net dataversity.net 7 1
  • 10. METADATA CASE STUDY Leveraging Metadata to Improve Unstructured Document Searching The Education Department of a large Midwestern state faced a problem: the state’s teachers needed a system to assist them in organiz- ing appropriate instructional materials for the classroom. The materials primarily resided as documents in the Education Department’s Documentum system. The system had a small amount of metadata in it, but there was no functional tool that allowed the easy search- ing and collecting of that content. The acquisition of a Google Search Appliance device promised to provide some measure of search functionality, but Google’s inter- face is predicated on one single text box allowing for full-text search without the ability to narrow or filter those search results. The teachers needed the ability to search by criteria such as grade level, subject area, or the type of content (lesson plans, content stan- dards, etc.). Developing a solution for the state relied on a multi-tier approach that first built a more robust Web search interface that added catego- rized collections of check boxes for the narrowing of search results, in addition to Google’s standard text box for full-text searching. A school backpack metaphor was used as the “shopping cart” for this Web-based application. As a teacher searched through the instructional materials, anything they wanted to use in the classroom was simply saved in their backpack for later retrieval. Different backpacks could be saved for each teacher depending on their needs. This front-end search interface and persistent backpack model would not function without improving the metadata stored on each document in Documentum. A side project involved retrofitting each instructional material document with the relevant metadata for grade level, subject, and content type. Luckily, Google’s Search Appliance API allowed additional data to be passed into the search request. Testing revealed the combination of full-text search with additional filtering provided by the Web interface and enhanced metadata returned the relevant instructional materials in the search result set. While Microsoft’s IIS and Active Server Pages were the preferred Web development solution at the Education Department, they were a few years behind in fully implementing the .NET framework. Because of that, the project became an interesting hybrid of Classic ASP at the front end with .NET code used to provide the backpack storage functionality, along with the Web service interface for the Google Search Appliance. Considering the additional search criteria, it was determined the standard Web page output of the Google Search Appliance was not sufficient to serve the needs of the application. The appliance’s search result set was available in XML format, so back-end code was written to enhance the output with graphics, a detailing of the search criteria, and a link to add the specific instructional material to the teacher’s backpack. This additional output combined nicely with the standard document summary normally provided by Google’s search. Even though a variety of pieces went into creating the best solution for teachers, leveraging improved metadata supplied the project’s ultimate success. Straight out of the box, the Google Search Appliance did not provide the necessary search result filtering needed to return the correct instructional material from a mass of unstructured documents. By enhancing the metadata stored in Documentum, Google’s search functionality greatly improved, and the rest of the project was able to proceed. dataversity.net 8 1
  • 11. Taxonomy While the term taxonomy is traditionally related to the world of biological species classification, it also plays a similar role in classify- ing terms for any number of subjects. Information Management taxonomies play a vital role in making sense out of unstructured data, primarily as a method for organizing metadata. Taxonomies in IT usually take one of two forms. The first form draws from the species classification origin of the term taxonomy, fol- lowing a hierarchical tree structure model. Individual terms in this kind of taxonomy have “parents” at the higher levels and “chil- dren” at corresponding lower levels. The second taxonomy form is essentially a controlled vocabulary of the terms surrounding any subject matter or system. This might take the form of a simple glossary or thesaurus, or something more complex and resource intensive, like the creation of a fully-formed ontology. This second type of taxonomy tends to be more common in the world of information technology. The ANSI/NISO Z39.19 standard exists for the authoring of taxonomies, information thesauruses, and other organized data dictionar- ies, illustrating the growing maturity of this data management discipline. Over the last decade, new companies and software packages centered on taxonomy management have come and gone, with a few earning acclaim as industry leaders, sometimes with taxonomies dealing with a specific subject matter. Ultimately, taxonomies, controlled vocabularies, and their similar brethren serve an enterprise by organizing metadata in a fashion that helps in finding valuable business information out of unstructured data. taxonomy CASE STUDY #1 Improving Corporate Intranet Search through the Development and Deployment of Taxonomies In the middle of the last decade, a large publically-held electronics company had a problem with unstructured information, estimated to be about 85 percent of all corporate data. In addition, over 90 percent of the corporate data contained no tagging. Issues with the duplication of content and determining the true age of documents also hampered the company’s associates when searching for infor- mation. To get a clearer picture of the scope of the problem, this company surveyed its employees on their Intranet search habits. The respons- es demonstrated that the employees wanted a better search interface with more categorization and sorting options, along with a more streamlined search result set. Some respondents felt frustrated with the difficulty in finding corporate documents through the current search interface. A corporate project team was formed with the responsibility of improving Intranet search. The core of this project was the develop- ment of various taxonomies and controlled vocabularies combined with metadata to improve the categorization of the internal unstruc- tured information; it was meant to provide benefits for the search interface and in the search results. Five taxonomies were developed and deployed in the first year of the project. Some were purchased externally and modified to fit the data at this corporation, while the others were fully developed from within. In all cases, certain employees were tasked as Subject Mat- ter Experts in their relevant areas for the purposes of creating the most suitable vocabularies and metadata for the taxonomies. The second year of the project saw the deployment of three additional taxonomies covering the areas of Human Resource, Six Sigma, and Legal. Two taxonomies were purchased externally and modified to suit, while the other was developed internally. Additionally, some improvement of the original categorization was done based on feedback after the previous year’s deployments. dataversity.net 9 1
  • 12. The benefits of the taxonomy development were obvious; they significantly improved all aspects of Intranet search: general employees and electronic engineers wasted less time weeding through bad search results, positively improving productivity, and the company’s bottom line. With the reduction of duplicated data, storage costs were improved, even when considering new data growth. Post project surveys and metrics revealed increased use of the improved search and the use of the categories. The general employee opinions on search improved. Ultimately, the internal development team won an award for their work on the project. taxonomy CASE STUDY #2 Folksonomies: A Hybrid Approach to Taxonomy Development The creation of taxonomies in any organization comes with associated costs, especially in the case of formal taxonomies where exter- nal consultants and Subject Matter Experts are involved with the process. In some cases, informal taxonomies exhibit smaller costs, but with more risk considering the relative lack of taxonomy creation experience compared with a more formalized process. A hybrid approach utilizes experts in taxonomy creation combined with a user-centric focus on the knowledge modeling for the project. It attempts to lessen the costs associated with a formal taxonomy, while still providing an organized development process. The term “folksonomy” is used to describe this more user-centric approach. Folksonomies leverage crowd-sourced wisdom on the relevant subject matter at hand, an approach not too different from social book- marking Web sites like Digg or Reddit, or even a blog community focused around a specific subject. In these kinds of hybrid taxonomy projects, users are able to submit Web sites and/or tags they would like to see included in the over- all vocabulary. An expert taxonomy team reviews these submissions for appropriateness and uniqueness. Finally, the submissions end up persisted in XML format for inclusion in the enterprise search process. The sharing of these internal social bookmarks and tags enhances the quality of enterprise search and also provides insight to the perception of an organization’s internally facing content. Users feel they are an important stakeholder in the process, thus improving company morale. Implementing a folksonomy normally involves installing some form of tagging tool, usually available as a module for an open-source CMS like Drupal or WordPress. A quality reporting system is also important in determining the efficacy of the tagging, in addition to providing metrics on how the internal content is being used. In many cases, a hybrid approach to taxonomy development hits a proverbial sweet spot in combining ROI with lower costs when compared to a full-fledged formal taxonomy. For smaller organizations, with a limited budget, it is an approach worthy of consider- ation. eDiscovery (Electronic Discovery) eDiscovery providers bring an automated search focus to leveraging valuable information from an organization’s unstructured data. They are similar to discovery systems used primarily in the library science and research industries. A separate section covering library discovery systems follows this current section. eDiscovery is widely used in the legal industry as a means for finding any evidence or information potentially useful in a case. In fact, dataversity.net 10 1
  • 13. court sanctioned hacking of computer systems is a valid form of eDiscovery. Computer forensics, normally used when trying to find deleted evidence off of a hard drive or other computer storage, is also related to eDiscovery. Electronic discovery systems are able to search a variety of unstructured data formats, including text, media (images, audio, and video), spreadsheets, email, and even entire Web sites. The best eDiscovery applications search everything from in-house server farms to the full breadth of the Internet in their quest to find valuable evidentiary information. Investigators peruse the discovered informa- tion in a variety of formats that run from printed paper to computer-based browsing. When potential litigation is a concern, the protection of corporate emails, instant messaging, and even metadata attached to unstruc- tured documents becomes crucial for any enterprise. This electronically stored information (ESI) became a focal point in changes to Federal Law in 2006 and 2007 that required organizations to retain, protect, and manage this kind of data. Yet, a 2010 study showed that only that while 52 percent of organizations have an ESI policy, only 38 percent have tested the policy, and 45 percent aren’t even aware whether any testing occurred. Considering the risk of future litigation, firms need to focus on the complete development, including testing, of a robust ESI management policy. As the eDiscovery industry matures, larger companies are bringing the discovery function in house as opposed to relying on external vendor-provided systems; other firms prefer a mix between internal and out-sourced solutions. Whatever their ultimate choice, firms need to realize the vital importance of well-defined and documented procedures for ESI archives and eDiscovery platforms. Discovery Systems The library sciences industry also depends on electronic discovery systems to provide research and other functionality to their con- sumers. An array of vendors and platforms has grown around this form of discovery, residing generally separate from the eDiscovery sector in the legal industry. Recently, enterprise-based knowledge management initiatives have embraced the traditionally library- based discovery process. While discovery systems for library sciences have previously focused on search products for traditional content (e.g. books and peri- odicals), in recent times their scope has broadened to include a wider range of material, including video, audio, and subscriptions to electronic resources. Additionally, content from external providers is now able to be discovered in the search. These discovery systems depend on the indexing of both document metadata along with the full text to provide a robust set of search results for the user. Content providers benefit from enhanced exposure as well as being able to control the display and delivery of the discovered materials. Obviously, cooperation between those creating the content and those creating the discovery system is paramount to ensure the proper indexing of metadata. With competing discovery system providers, those also producing content sometimes choose to not index their documents with a competitor. This occurred when EBSCO removed its content from the Ex Libris Primo Central discovery system just before introduc- ing its own EBSCO Discovery Service. Other pundits worry organizations that both produce content and a discovery service will skew any search results to favor their own content. Consumers, including libraries, need to take into account these issues before subscribing to any discovery system. Text Analytics Text analytics is another related discipline useful in deriving meaningful information from unstructured data. The term became widely used around the year 2000 as a more formalized business-based outgrowth of text mining, a technique in use since the 1980s. In addition to raw text mining, text analytics uses other more formalized techniques, including natural language processing, to turn un- structured text into data more suitable for Business Intelligence and other analytical uses. Considering that a majority of unstructured dataversity.net 11 1
  • 14. data resides in a textual format, text analytics remains one of the most important techniques for making sense of unstructured data. Professor Marti Hearst from the University of California, in her 1999 paper Untangling Text Data Mining, presciently described the current practice of text analytics in today’s business climate: “For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collec- tions to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.” Text analytics makes up the basis of some of the previously mentioned methods used in capturing unstructured data, most notably discovery systems and eDiscovery. It is also employed by organizations to monitor social media for human resources applications or personally targeted advertising. Machine learning and semantic processing are two other sub-disciplines of text analytics at the forefront of innovation. IBM’s Watson computer, famous for defeating Jeopardy! all-time champion Ken Jennings, is an apt example of the power of semantics at the higher end of computer science. In addition to natural language and semantic processing, a typical text analytics application might include other techniques or process- es, such as named entity recognition, which is useful in finding common place names, stock ticker symbols, and abbreviations. Dis- ambiguation methods can be applied to provide context when faced with different entities sharing the same name: Apple, the Beatles record company, compared to Apple, the consumer electronics giant. Regular expression matching is a straightforward process used in parsing phone numbers along with email and Web addresses from unstructured text. Conversely, sentiment analysis is a more difficult technique, attempting to derive subjective information or human opinion from text. Machine learning combined with sophisticated syntactic analysis techniques normally make up the basis of auto- mated sentiment analysis. Text analytics remains a wide-ranging disciplineused in both the business and scientific worlds. From leading edge applications like IBM’s Watson to day-to-day tasks like parsing emails from text, it is an important part of capturing unstructured data. dataversity.net 12 1
  • 15. Written by Christine Connors Integrating Unstructured Data and Putting it to Use in your Real World Christine Connors, Principal, TriviumRLG LLC Various permutations of text (word processing files, simple text files, emails etc.) make up the Christine Connors has extensive largest amount of unstructured data currently in the enterprise. Many firms are in the process experience in taxonomy, ontology of implementing unstructured data management projects to find useful information from the and metadata design and develop- immensity of corporate email. ment. Prior to forming TriviumRLG Ms. Connors was the global direc- Content Management Systems exist partially to help an enterprise manage and derive informa- tor, semantic technology solutions tion from the data contained in unstructured text documents. Most of these systems leverage for Dow Jones, responsible for metadata to provide an extra layer of classification allowing for easier searches and enhanced partnering with business champions reporting. across Dow Jones to improve digital asset management and delivery. Now that you are capturing and storing your data from unstructured sources, what can you do In that position, she managed a with it? Where can you put it to good use? What new categories of applications are best suited worldwide team responsible for the to exploit it? development of taxonomies, ontolo- gies and metadata that are used to add value to Dow Jones news and Asset Valuation financial information products. Ms. Your organization has created terabytes of intellectual property - what is its value? Value is Connors also served as business a difficult thing to assess. A piece of information stored “just in case” today may or may not champion for the Synaptica® soft- become the critical missing piece of the puzzle down the road. That is assuming of course that ware application, including manag- ing a US- based team of software you can both find and use that information. developers, and supported Dow Jones consulting practices world- Applying structure to your data will enable two critical processes: wide, which deliver end-to-end 1) De-duplication and ‘weeding’ of bad or unnecessary data information access solutions based 2) Visualizing what remains on taxonomies, metadata and semantic technologies. Prior to join- Once that weeding has occurred, the costs of data storage can be estimated based upon hard- ing Dow Jones Ms. Connors was a ware and maintenance expenses. These costs are then weighed against the value of the digital knowledge architect at Intuit, where assets. Value is determined by the alignment of the type and nature of the data against the she was responsible for introduc- organization’s core goals. Is the data used as inputs to product or to measure the profits of the ing semantic technologies to online products or services being delivered? How many degrees of separation are there? Does the content management and search. data identify opportunities or threats in the marketplace? What are the predicted profits and And before that, she was a Meta- potential losses? data Architect at Raytheon Com- pany and Cybrarian at CEOExpress Expertise Location Company. At Raytheon Company she oversaw knowledge repre- Consider all of the resumes your HR department has collected; they are a wealth of unstruc- sentation and enterprise search, tured data regarding the talents of your employees. Using text analytics, a profile can be built delivering large-scale taxonomies, of each person. The application of structure to such a collection of existing data means that metadata schema and rules-based project heads can easily identify potential team members who have the experience needed classification to improve retrieval of to tackle a particular challenge. Add to that the HR-approved information collected during petabytes of internal information via performance reviews, and the team leader has a better handle on what strategies to employ a multi- vendor retrieval platform. managing the team’s efforts. dataversity.net 13 1
  • 16. Business Intelligence It’s always good to know how exterior forces can impact your organization. Perhaps your best customer has joined the board of a non- profit organization -- a board on which an executive at your top competitor is already a member. How will that new network connec- tion affect your relationship with your customer’s organization? Or perhaps one of your top engineers has been asked by her alma mater to work with a lead professor, his students, and another alumnus on a project that could benefit the school. Could you lose this top performer to a new venture? Is the research worth investing in? Would you like to set up an alert, using a discovery system, to keep track of the internal memos and external press regarding the project? New Product Development Your researchers and editors have crafted fabulous publications. System analysis reveals that your subscribers are only using bits and pieces of this published work, reading a page or two, reading only the abstract, or going right to charts and visualizations. How can you break out these sub-sections of content with minimal overhead? How can you start quickly by using existing content? By identifying entities in your content, you can re-use or create new graphs, charts, and even user-filtered data visualizations. The enti- ties are identified by analyzing the existing content, extracted or tagged, and then indexed for re-use. Doing this work allows you to publish sub-sets of data within a single publication or across publications. You can use wholly owned, licensed or a combination of content as contractually permitted. You can integrate your data in multi-media tools or social networking sites. Requirements for Unstructured Data Projects By Christine Connors, Principal, TriviumRLG LLC As with any undertaking, requirements are needed for an unstructured data project. It isn’t about simply exposing the contents of the documents. It is about making that content useful to the systems and people who need to use them. Or as many experts have said in other applications: making the right content available to the right people at the right time. There may be documents exposed that you didn’t know were there, that shouldn’t be publicly available, and are available because of an error somewhere in the applications. Imagine the trouble if your new system indexed an HR spreadsheet with salaries, addresses, and social security numbers, while being backed up onto a shared drive the user thought was secure? Consider the content collections that will be part of the program. Do you anticipate any of it having restrictions? If so, then what are those restrictions? How will authorized users authenticate and gain access? Will you restrict access by entity type? By rules-based classification? By system access and control policies? These are important things to consider. Given that you might find documents you weren’t expecting, how will you architect the back end to scale effectively? Will it be easily repeated on additional clusters? What OS and software will it need to run? Will it fail over? Can it scale to handle the number of users, documents, and entities predicted for the anticipated life of the hardware? Once you’ve determined that, how will users interact? What will the front end need to provide? Typically users manage Create - Read - Update - Delete rights as permissioned within a system. They also search, browse, publish, integrate, migrate, and import to and from dataversity.net 14
  • 17. other systems. What tools are needed to support these actions? Should select users be able to perform administrative tasks via a client or browser interface? How about the ability to generate reports? What operating systems does this interface need to support? Once you’ve got your content under control, how are you going to package and publish it? What other applications need to use the data? What are the interoperability requirements? The ongoing identification of your unstructured data is critical to avoid undertaking such a project again. One method is via Metadata Management. What requirements do you have there? What kinds of information remain important to manage, in addition to the meta- data elements? Will you need taxonomy? Do you need an external tool or is there a module within your current CMS, DMS or portal solution that will suffice? There are many questions here, but most of the overall process is not too different than anything led by a competent project manager. These tasks can be completed in parallel or serially in combination with usability tests, surveys, and focus groups. Taxonomy develop- ment, if needed, will benefit from the guidance of an expert, as it is not typically a linear process like that of software development. Summing Up the Opportunities Created by Unstructured Data The existence of vast quantities of unstructured data at any organization is not necessarily a problem. In fact, it needs to be considered as an opportunity for success. The various case studies contained within proved that projects focused on deriving information out of seemingly unrelated data in many cases allowed firms with a proactive attitude to gain a competitive advantage when compared to firms who fear or ignore such unstructured data. This paper provided background on the various types of unstructured data along with a collection of time-honed techniques for captur- ing and managing that data. The real world case studies provided inspiration in solving potential unstructured data issues. The list of applications and organizations with tools related to unstructured data are an excellent starting point for researching the wide issues in this sector of data management. Additionally, the included survey data revealed that most firms are taking an active approach with unstructured data, so no one should feel alone when considering their own data issues. It is a fascinating area of expertise that continues to change and evolve; there are many opportunities for organizational success, personal success, and knowledge growth, with so many transformations occurring all the time. dataversity.net 15
  • 18. Appendix: Open Source and Commercial Applications around Unstructured Data Teradata: Teradata produces enterprise Business Intelligence and Data Warehousing software. Their suite also provides functional- ity that facilitates data extraction from various unstructured and structured sources into a proprietary relational database. The company recently added text analytics (from Attensity) to its software suite to better analyze certain types of unstructured data, including text documents and spreadsheets. Teradata Corporation 10000 Innovation Drive Dayton, OH 45342 Phone: (866) 548-8348 Attensity: Attensity specializes in the extraction of meaning from unstructured data through the use of text analytics. They recently partnered with Data Warehousing provider, Teradata, to add text analysis to the latter’s software suite. Attensity Group 2465 East Bayshore Road Suite 300 Palo Alto, CA 94303 Phone: (650) 433-1700 DataStax: DataStax is known as a leader for commercial implementations of the Apache Cassandra database. Last year’s introduc- tion of DataStax Enterprise combines Cassandra with the company’s OpsCenter product, all running on the Hadoop framework. DataStax HQ - SF Bay Area 777 Mariners Island Blvd #510 San Mateo, CA 94404 Phone: (650) 389-6000 Taxonomy Providers Smartlogic: Smartlogic is an enterprise taxonomy provider primarily known for Semaphore, a content intelligence platform prom- ising improved control of and easier access to an organization’s unstructured data. Smartlogic US 560 S. Winchester Blvd, Suite 500 San Jose, California, 95128 Phone: (408) 213-9500 Fax: (408) 572-5601 Synaptica: Synaptica is an innovation leader in the areas of enterprise taxonomy and metadata. Their platform integrates with Mi- crosoft SharePoint, providing a method to store Synaptica’s taxonomy within SharePoint, facilitating unstructured document search. Synaptica, LLC 11384 Pine Valley Drive Franktown, CO 80116 Phone: (303) 298-1947 dataversity.net 16 1
  • 19. Teragram: Recently acquired by SAS, Teragram is an expert in the world of linguistic search. They help organizations to better manage unstructured content in a variety of languages allowing an enterprise to further develop its international presence. SAS Institute Inc. 100 SAS Campus Drive Cary, NC 27513-2414 Phone: (919) 677-8000 Content Management Systems Documentum: Documentum remains one of the largest content management system (CMS) platforms in the industry. The software facilitates the management of business documents as well as a host of other unstructured data types, including images, audio, and video. Documentum is now owned by IT services conglomerate, EMC Corporation. EMC Corporation 176 South Street Hopkinton, MA 01748 Phone: (866) 438-3622 MarkLogic: MarkLogic makes enterprise software to help organizations manage unstructured data. Their system is based on the use of XQuery for fast retrieval of documents marked up with metadata in XML format, and thus scales nicely when accessing Big Data stores. MarkLogic Corporation Headquarters 999 Skyway Road, Suite 200 San Carlos, CA 94070 Phone: (877) 992-8885 WordPress: WordPress is one of the most popular open source blogging and content management platforms. A robust community has grown around the platform which leverages the MySQL and PHP open source solutions for database and scripting functionality. Drupal: Drupal is another open source content management platform, but without the self-blogging focus of WordPress. Discovery Systems Verity K2: K2 is an enterprise search platform, or discovery system, used by organizations wanting to provide intelligent searching of the mass of corporate unstructured data. Verity was acquired by Autonomy, which in turn was recently acquired by HP. Autonomy US Headquarters One Market Plaza Spear Tower, Suite 1900 San Francisco, CA 94105 Phone: (415) 243 9955 Serials Solutions’ Summon Service: The Summon service from Serial Solutions is a Web-scale discovery system used pri- marily by libraries. Summon provides search functionality on a full range of media, including audio, video, and e-content, in addition to books. dataversity.net 17 1
  • 20. Serial Solutions North America 501 North 34th Street Suite 300 Seattle, WA 98103-8645 Phone: (866) SERIALS (737-4257) EBSCO Discovery Service: EBSCO’s Discovery Service facilitates discovery of an institution’s resources by combining pre- indexed metadata from both internal and external sources to create a uniquely tailored search solution known for its speed. Although known for their database and e-book services, EBSCO’s Discovery Service primarily supports research institutions and libraries. EBSCO Publishing 10 Estes Street Ipswich, MA 01938 Phone: (800) 653-2726 (USA Canada) Fax: (978) 356-6565 OCLC WorldCat Local: WorldCat Local is a library-based discovery system provided by the Online Computer Library Center (OCLC). The system provides single search box access to over 922 million items from library collections worldwide. OCLC is also the organization responsible for first developing the Dublin Core Metadata Initiative. OCLC Headquarters 6565 Kilgour Place Dublin, OH 43017-3395 Phone: (614) 764-6000 Toll Free: (800) 848-5878 (USA and Canada only) Fax: (614) 764-6096 Ex Libris Primo Central: The Primo Central Index is the centerpiece of Ex Libris’ Primo Discovery and Delivery discovery system focused on providing access for the research scholar audience to hundreds of millions of documents. Ex Libris is a leading provider of automation solutions for the library sciences market. Ex Libris 1350 E Touhy Avenue, Suite 200 E Des Plaines, IL 60018 Phone: (847) 296-2200 Fax: (847) 296-5636 Toll Free: (800) 762-6300 Metadata Dublin Core Metadata Initiative: The Dublin Core Metadata Initiative is a metadata collection primarily used by libraries and education institutions worldwide. It was originally developed by the Online Computer Library Center (OCLC), located in Dublin OH. Esri ArcCatalog: Esri’s ArcCatalog is a tool within their ArcGIS software suite used for the development and management of GIS-related metadata. Esri is a worldwide leader in the management of geographic data. Esri Headquarters 380 New York Street Redlands, CA 92373-8100 Phone: (909) 793-2853 dataversity.net 18 1
  • 21. SchemaLogic MetaPoint: MetaPoint is a metadata tagging and management tool developed by SchemaLogic. Its primary audi- ence is companies with large investments in the Microsoft-based document tools, Office, and SharePoint. MetaPoint promises to pro- vide the missing connection between the two software products. SchemaLogic was recently acquired by taxonomy systems provider, Smartlogic. Smartlogic US 560 S. Winchester Blvd, Suite 500 San Jose, California, 95128 Phone: (408) 213-9500 Fax: (408) 572-5601 about dataversity We provide a centralized location for training, online webinars, certification, news and more for information technology (IT) pro- fessionals, executives and business managers worldwide. Members enjoy access to a deeper archive, leaders within the industry, knowledge base and discounts off many educational resources including webinars and data management conferences. For questions, feedback, ideas on future topics, or for more information please visit: http://www.dataversity.net/, or email: info@dataversity.net. dataversity.net 19