More Related Content Similar to ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic Web stack in a corporate IS environment." (20) ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic Web stack in a corporate IS environment."1.
Linked Enterprise Data:
leveraging the Semantic Web stack
in a corporate IS environment
This paper has been selected and presented in the Industry track at ISWC 2012 Boston
Fabrice Lacroix – Antidot - lacroix@antidot.net
The context
Business information systems (IS) have developed incrementally. Each new operating need has
generated an ad hoc application: ERP, CRM, EDM, directories, messaging, extranet and so on. IS
development has been driven by applications and processes, each new application creating another
data silo. Organizations are now facing a new challenge: how to manage and extract value from this
disparate, isolated data. Companies need an agile information system to deliver new applications at an
ever-increasing pace, developed from existing data, without creating a new warehouse or adding
complexity.
Over the past twenty years, various solutions attempting to tackle the problems raised by data
proliferation have appeared: BI, MDM, SOA. While these tools undoubtedly provide benefits, they entail
in most cases a long and costly deployment process and make the overall system even more complex.
What’s more, none of them is able to address the challenges of an ever faster changing technological
environment. A versatile IS should:
• Pool data to create information that will provide a new operational service,
• Integrate and distribute data between applications, both internally and externally with its
ecosystem,
• Provide an information infrastructure that emphasizes agility and ease of use.
Therefore, we need to look beyond the technological issues and change the paradigm. Instead of
focusing on applications, we must place the data at the heart of the approach. And for that, the recent
evolution of the Web blazes the trail.
Why we use the Semantic Web stack
Originally designed to serve as a universal document publication system, the Web has radically evolved
over the past 15 years. The Web of Data, also known as the Semantic Web, is the latest iteration of
the Web, in which computers can process and exchange information automatically and unambiguously.
It goes well beyond simple access to raw data by providing a way of interweaving the semantized data.
This process, known as Linked Data, creates a decentralized knowledge base in which the value of
each piece of information is enhanced by its links to complementary data.
Being a software vendor in the realm of information access solutions (enterprise search engines, data
management and enrichment), Antidot has been working for a long time on solutions creating a unified
informational space drawing on all of the company’s documents and data, meshing unstructured and
structured information. In 2003 Antidot foresaw in the Semantic Web technologies an elegant way to
tackle the challenge of enterprise data integration and in 2006 we started evaluating and integrating
them in our solutions. Four years of development and several major projects with various customers
and business subjects allowed us to figure out a way to efficiently use those technologies. We strongly
support Linked Enterprise Data (LED), the application of the Linked Data principles to the corporate
IS1.
However, the way we use Semantic Web may seem heretical to the conventional principles. Hereafter
we report key aspects of our approach.
1
For more information on our LED approach, read our white paper “Linked Enterprise Data – Principles, Uses and
Benefits” – http://bit.ly/LED-EN (PDF, 24 pages, 5.6MB)
© Antidot – Linked Enterprise Data – ISWC 2012 1/4
2. How we use the Semantic Web stack
The classical Semantic Web architecture for
integrating data from various silos relies on a
federated principle where a query is
synchronously distributed over the sources
through SPARQL endpoints exposed by each of
them.
This approach presents many scientific and
technological challenges but considering the
rationale behind the Web of Data and the need
to work in the gigantic open Web space, this
seems to be the only reasonable way to make it
work.
Though theoretically correct, this approach is not applicable to the corporate IS for a large variety of
reasons:
• The corporate information system is built with numerous legacy or closed applications that
cannot be adapted or extended with Sparql endpoints.
• The enterprise information realm is made up at 80% of unstructured or semi-structured data
that cannot fit in the model as such.
• Enterprises do not want access to raw data in RDF format. They want to reap valuable
information derived from the data, which requires large and complex computations to create
these new informational objects.
• The bottom-up approach of mapping silos and their data to RDF to fit the model requires an
enormous work for defining vocabularies or ontologies for each source, which is a too heavy
investment.
• Companies dream of seamlessly integrating external data to leverage their internal
information. But this external data is mostly available in XML or JSON through Web Services,
and not yet in RDF, so that using Sparql as a way to query and integrate does not make sense.
• IT departments have invested heavily in their “relational database for storing / XML for
exchanging / Web apps for accessing” infrastructure. Their staffs are trained for this paradigm.
They lack in-house skills for integrating the graph-way-of-thinking.
• Stability matters most and Semantic Web technology is unknown, considered as new and
immature: CIOs are not ready to take the risk of adding load and technological uncertainty on
systems that are critical to the company for its daily business operations.
For all these reasons, the RDF-Sparql paradigm as described above is not ready to enter the corporate
IS and it may even face some resistance.
However, we think the Semantic Web is the solution to create an agile information system. The way
we use it at Antidot is tightly related to the architecture of the data processing workflow we set up in
projects. Being a long time software vendor of information access solutions, we have rapidly come to
the conclusion that there is no good search engine, whatever the technology, if the data quality is not
good enough.
To meet this need, we have developed Antidot Information Factory (AIF), a software solution
designed specifically to enrich and leverage structured and unstructured data. Antidot Information
Factory is an "information generator" that orchestrates large-scale processing of existing data and
automates publishing of enriched or newly created information.
The data processing workflows, named dataflows, always have the same pattern: Capture – Normalize
– Semantize – Enrich – Build – Expose.
© Antidot – Linked Enterprise Data – ISWC 2012 2/4
3. Harvest and Normalize – Those are regular functions as seen in ETL systems: extract the data from
the sources, clean it and transform it. We tailor the Normalize process by aligning fields content in
order to mesh data coming from different sources (such as records from a CRM and an ERP). For
extracting records from relational databases and transforming selected records in RDF, we have
developed a R2RML and Direct Mapping compliant module named db2triples2.
Semantize – This critical step is a corner stone of our approach. We cherry-pick a subset of
interesting fields of each object and create their RDF triples counterpart.
Generating the triples requires two actions:
URI generation – The URIs are generated according to few principles. We chose the form of a
URL even though they are not directly dereferenceable: since the sources are not Semantic Web
compliant, our solution is in charge of maintaining a mapping allowing the real server access. The
path contains the necessary information to access the record in the source system.
Example: a record extracted from the CRM will have a URI of the form
http://data.mycompany.com/crm/expr_id, where data.mycompany.com points to our solution,
crm is a nickname for the CRM source chosen during setup, and expr_id is an expression and/or
identifier that unambiguously points the record and allows backtracking to the original data.
Choosing the predicates – Experience has led us to the conclusion that “big ontologies” and
“upfront ontology design” must be avoided in enterprise projects. The idea is not to model or
describe each and every aspect of the processes and data inside the company, but to build
incrementally the necessary information. Not to mention the fact that in our approach, the graph
is a means and not an aim. Therefore, we foster the use of existing vocabularies (like DC, Foaf,
Organization, …) and mesh them as needed. When the enterprise has defined internal XML
formats, we reuse them by transforming tags and attributes name to triples predicates.
Pragmatism is the rule.
Unstructured documents like office files, PDF files or emails content don’t fit the RDF formalism and
cannot be linked to the graph as such. Extra work is necessary:
First, we transform available metadata like document name, author, creation date, sender and
receivers for a mail, subject and so forth into RDF.
Then, we use text-mining technology to extract named entities like people, organizations,
products, etc. from the documents. These entities lists are generated using different sources of
the enterprise: directories, CRM or ERP are providing people and company names, while products
are listed in ERPs or taxonomies. Each annotation generates a triple where the subject is the
document URI, the object is the entity URI, and the predicate depends on the entity type but
mostly means “quotes” (doc_URI quotes entity_URI).
And last, we run various specific algorithms designed to do document versus document
comparison to detect duplicates, different versions of the same document, inclusions, semantically
related ones, etc. Each of these relations is inserted in the graph with an appropriate predicate.
By doing so, and thanks to the syntactic alignment done at the Normalize step, we start linking data
together, mostly based on shared field values. This creates a first sparse graph.
But the key question here is “Why do we transform only a subpart of the harvested data in RDF and
what do we do with the rest of it?” Indeed, not to mention the fact that text documents are not graph
friendly, as stated above we only transform a selected part of the structured data into RDF:
From a technical standpoint we don’t feel like the technology is mature and stable enough to
proceed differently. In industrial projects, millions of seed objects are regularly extracted from the
sources (invoices, clients, files, etc.), each having tens of fields. And having billions of triples
doesn’t scale well in available triplestores.
Transforming only a subpart of the data largely simplifies the task of choosing the predicates,
hence reinforces the choice of using many small available vocabularies instead of big ontologies.
The data that is not transformed to RDF is stored by Information Factory for later use during
the Build step.
The very diversity of enterprise data implies a flexible and pragmatic strategy: the graph is only a part
of it.
2
db2triples is compatible with R2RML and Direct Mapping Recommendations from May 29th 2012 and has
successfully passed the validation tests (http://www.w3.org/2001/sw/rdb2rdf/implementation-report/). We have
open sourced it and made it available at http://github.com/antidot/db2triples.
© Antidot – Linked Enterprise Data – ISWC 2012 3/4
4. Enrich - The next step is dedicated to enriching both the objects (the records captured in the sources)
and the graph. Depending on the content type and of the project needs, we run various algorithms
such as text mining, classification, topic detection, etc depending on the content type and project
needs. This complementary information is included into the graph. We also integrate external
information either by importing data sets in the graph or by querying external sources and mapping
the result to RDF triples to link them to the graph.
Build - Once the graph is decorated, the key step is to build the knowledge objects that are the real
target of the project. We start by executing inference rules (mostly Sparql construct queries) in order
to saturate the graph. Then we extract those objects from the graph: this requires a mix of select
queries plus dedicated graph traversal and sub-graph selection algorithms. Moreover, since we haven’t
originally transformed all the data in RDF nor transferred it to the graph, the objects we extract are
like skeletons that need to be complemented with the original data that was left outside the graph: this
task is completed through specific algorithms and tools embedded in Information Factory, designed to
merge RDF-objects extracted from the graph with structured data and documents previously
harvested.
Expose - Finally, the knowledge objects created are made available to the IS and the users through
various ways, depending on the environment and the needs. They can be dumped in XML files,
injected in a database, or indexed and made available through a semantic search engine following the
Search Base Application – SBA - paradigm. Of course, these objects can be loaded in a triplestore and
made available in RDF native format through a dedicated Sparql endpoint.
Hence, we have chosen the Semantic Web technology for very pragmatic reasons:
• The RDF/OWL formalism is perfectly suited for modeling. Its graph nature fits our needs of
agility and flexibility. It fosters bottom-up small-scale projects that will offer a quick,
inexpensive response to an isolated business need. Each new project will gradually enlarge the
information graph, without the need to revise or overhaul the initial models.
• The Semantic Web benefits from an ecosystem of existing solutions and tools (triplestores,
inference engines, Sparql endpoints, modeling tools, etc.), as well as partners and skills. The
open and standard nature of its formats and protocols guarantees investment sustainability:
the created data is always accessible and reusable, independently of the technology providers.
• The momentum around Open Data and Linked Data strengthens the credibility of the
approach. Though not yet mainstream, CIOs are interested in evaluating the technology and
benefits behind these concepts, and they approve of the opportunity to extend toward a real
Semantic Web project in the future, leveraging this first investment.
Conclusion
Linked Enterprise Data’ strategy and underlying Semantic Web standards represent a comprehensive
response to the challenge of creating an agile, high-performance information system.
Our approach has proven to be pragmatic and efficient, delivering the project agility and the data
versatility expected. Our value proposition is not technology itself: we offer to create valuable
information in an agile way for business needs.
CIOs don’t express yet the need for Semantic Web or Linked Data, they haven’t planned to setup a
triplestore in their infrastructure. But we think that the Semantic Web stack is the right tool.
Linked Enterprise Data approach will prove its value whereas we might not be able to convince our
customers to dive directly into a global Web of Data approach. And it might allow later projects to use
more directly and openly Semantic Web technologies.
© Antidot – Linked Enterprise Data – ISWC 2012 4/4