The document discusses how to gain understanding from big data through effective data governance and classification. It argues that proper categorization of data using controlled vocabularies, taxonomies, and ontologies improves search, analytics and other uses of big data. A framework is presented outlining the key components of a data governance lifecycle for big data, including content creation, mining and classification, management of vocabularies/taxonomies/ontologies, and use of the structured data for search, transactions and analytics. Effective use of this framework can help organizations apply meaning and understanding to their big data.
2. How to make meaning out of Big
Data
Big Data as the poster-child for marketing of open-source
software built-off alternative database storage structures has
become a 'Big Nothing'. The ambiguity around what Big Data
means requires endless hours of explanation and really only
focuses on the problems around dealing with data containing
large volumes, velocity, or variety (I'm waiting for more catchy
v's such as Victory, or Value!). My perspective is around the
phrase 'Big Understanding' which is an optimistic 'View' of
making sense of our data and turning it into information. The
focus has to shift.
3. Classification = Relevance
No matter what vendors say, the better the classification and
structure of your data, the better your search and analytical
capabilities will be. Even tools that help with classification
require custom rules and dictionaries, and they tend to be
domain specific. If you want high quality Big Data, you need
Data Governance.
4. Data Governance = Big Quality
If you want a high-quality analysis, your data has to be
standardized and consistent. This is especially true where there
is a large degree of variety in your inputs. For example, if you
have different geopolitical hierarchies for each input source, you
have to align them into a standard, or your customer won't find
Colorado information when they typed in CO (ok, a trivial
example, but valid). Data Governance requires people, process,
and tools, and often requires organizational change.
Many companies would benefit more from improving the quality
and 'findability' of their data over piling more data into an already
inconsistent data store.
5. Data Governance Lifecycle
Applying Data Governance to Big Data helps you to
Understand the quality of your data
Be able to categorize it into well-defined groupings, with
commonly shared definitions
Be able to look at new data and categorize it into new or existing
groups
Share it with your stakeholders
Manage it over time
6. A Framework to gain perspective
The following slides attempt to provide a framework for
understanding the lifecycle around information management
and understanding form the perspective of managing and
applying meaning to your data
7. Communication between
People and Processes
Data Governance
Life Cycle
VTO Management
Transactions
Content Creation &
Sourcing
Content +
Governed VTO
Vocabulary
Taxonomy
Ontology
(VTO)
Unstructured
Content
Structured
Content
Content
VTO
VTO
Content
VTO
Content Mining
& Classification
Analytics
Search
8. Vocabulary, Taxonomy, and
Ontology (VTO)
Humans use systems of organization to make order of their
world
Effective experiences with Big Data are driven by Subject Matter
Experts or machines categorizing content with a common
language that can be shared and understood by consumers of
the data
Governed Vocabularies, Taxonomies, and Ontologies are the
pick-lists, hierarchies, and relationships that define content,
which Subject Matter Experts use to categorize, share, and
analyze data
9. Content Creation and Sourcing
Content is created by people interacting with computer systems as well
as by machines generating data
When you have more than one stream of data being produced by
different inputs, the rules for categorization differ between systems
Understanding your data sources whether it’s one or more systems
require you know how the data is produced, and therefor how it can be
analyzed
Big Data promises that you don’t need to know the meaning of your
input data as you collect it
It doesn’t mean that you don’t need to define and understand it before
you begin to analyze it
If you apply meaning and structure to your data, the quality of your
analysis will improve or even be possible
10. Content Mining and Classification
Categorization of your data isn’t a one-time event unless your
analysis is a one-time event
Subject Matter Experts need the ability to analyze new data,
and revisit old data to make sure nothing has changed
Content Mining is a technique to bring understanding to your
data and how it fits to your view of the world
Most Big Data Platforms are weak (today) in this area
For Big Data, there is a disconnect in how vendors support
tooling from when we analyze our data and when we
categorize it and apply meaning
11. VTO Management
Vocabularies, Taxonomies, and Ontologies require
management over time
They are not done in isolation, requiring collaboration
between Subject Matter Experts and stakeholders
They must be easily shared, versioned, and implemented
against your data
Application of defined VTO’s against Big Data is a challenge
in current vendor offerings
12. Search, Transactions, Analytics
Search – keyword or navigated searching through detailed or
aggregated data
Transactions – adding data to an existing store via people or
machines
Analytics – statistics, probabilities, creating models …
Big, Medium, or Small data for each of these activities are
benefited by good categorization and application of VTO
standards
13. Conclusion
As Big Data continues to gain momentum in the confusing
vendor marketplace, don’t loose sight of the basics, don’t
give in to unbounded promises of being able to analyze your
data to perfection without consideration of the end-goal of
why you are collecting this data in the first place -
To apply meaning and understanding to your problem at-hand,
and share it with people who can take fruitful action that results in
improvement