Big Data and Classification

Big Data and Content
Classification
Paul Balas

How to make meaning out of Big
Data
 Big Data as the poster-child for marketing of open-source
software built-off alternative database storage structures has
become a 'Big Nothing'. The ambiguity around what Big Data
means requires endless hours of explanation and really only
focuses on the problems around dealing with data containing
large volumes, velocity, or variety (I'm waiting for more catchy
v's such as Victory, or Value!). My perspective is around the
phrase 'Big Understanding' which is an optimistic 'View' of
making sense of our data and turning it into information. The
focus has to shift.

Classification = Relevance
 No matter what vendors say, the better the classification and
structure of your data, the better your search and analytical
capabilities will be. Even tools that help with classification
require custom rules and dictionaries, and they tend to be
domain specific. If you want high quality Big Data, you need
Data Governance.

Data Governance = Big Quality
 If you want a high-quality analysis, your data has to be
standardized and consistent. This is especially true where there
is a large degree of variety in your inputs. For example, if you
have different geopolitical hierarchies for each input source, you
have to align them into a standard, or your customer won't find
Colorado information when they typed in CO (ok, a trivial
example, but valid). Data Governance requires people, process,
and tools, and often requires organizational change.
Many companies would benefit more from improving the quality
and 'findability' of their data over piling more data into an already
inconsistent data store.

Data Governance Lifecycle
 Applying Data Governance to Big Data helps you to
 Understand the quality of your data
 Be able to categorize it into well-defined groupings, with
commonly shared definitions
 Be able to look at new data and categorize it into new or existing
groups
 Share it with your stakeholders
 Manage it over time

A Framework to gain perspective
 The following slides attempt to provide a framework for
understanding the lifecycle around information management
and understanding form the perspective of managing and
applying meaning to your data

Communication between
People and Processes
Data Governance
Life Cycle
VTO Management
Transactions
Content Creation &
Sourcing
Content +
Governed VTO
Vocabulary
Taxonomy
Ontology
(VTO)
Unstructured
Content
Structured
Content
 Content
VTO 
VTO
Content
VTO
Content Mining
& Classification
Analytics
Search

Vocabulary, Taxonomy, and
Ontology (VTO)
 Humans use systems of organization to make order of their
world
 Effective experiences with Big Data are driven by Subject Matter
Experts or machines categorizing content with a common
language that can be shared and understood by consumers of
the data
 Governed Vocabularies, Taxonomies, and Ontologies are the
pick-lists, hierarchies, and relationships that define content,
which Subject Matter Experts use to categorize, share, and
analyze data

Content Creation and Sourcing
 Content is created by people interacting with computer systems as well
as by machines generating data
 When you have more than one stream of data being produced by
different inputs, the rules for categorization differ between systems
 Understanding your data sources whether it’s one or more systems
require you know how the data is produced, and therefor how it can be
analyzed
 Big Data promises that you don’t need to know the meaning of your
input data as you collect it
 It doesn’t mean that you don’t need to define and understand it before
you begin to analyze it
 If you apply meaning and structure to your data, the quality of your
analysis will improve or even be possible

Content Mining and Classification
 Categorization of your data isn’t a one-time event unless your
analysis is a one-time event
 Subject Matter Experts need the ability to analyze new data,
and revisit old data to make sure nothing has changed
 Content Mining is a technique to bring understanding to your
data and how it fits to your view of the world
 Most Big Data Platforms are weak (today) in this area
 For Big Data, there is a disconnect in how vendors support
tooling from when we analyze our data and when we
categorize it and apply meaning

VTO Management
 Vocabularies, Taxonomies, and Ontologies require
management over time
 They are not done in isolation, requiring collaboration
between Subject Matter Experts and stakeholders
 They must be easily shared, versioned, and implemented
against your data
 Application of defined VTO’s against Big Data is a challenge
in current vendor offerings

Search, Transactions, Analytics
 Search – keyword or navigated searching through detailed or
aggregated data
 Transactions – adding data to an existing store via people or
machines
 Analytics – statistics, probabilities, creating models …
 Big, Medium, or Small data for each of these activities are
benefited by good categorization and application of VTO
standards

Conclusion
 As Big Data continues to gain momentum in the confusing
vendor marketplace, don’t loose sight of the basics, don’t
give in to unbounded promises of being able to analyze your
data to perfection without consideration of the end-goal of
why you are collecting this data in the first place -
To apply meaning and understanding to your problem at-hand,
and share it with people who can take fruitful action that results in
improvement

Big Data and Classification

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Big Data and Classification

Semelhante a Big Data and Classification (20)

Último

Último (20)

Big Data and Classification