SlideShare uma empresa Scribd logo
1 de 93
Access Innovations, Inc..

12/11/2013
Marjorie M.K. Hlava
President
Access Innovations, Inc.
Mhlava@accessinn.com

12/11/2013
Access Innovations, Inc..

Big Data Content
Organization, Discovery,
and Management
Access Innovations, Inc..

• Big Data
• New Government Initiative
• Content Organization
• Discovery (Search)
• Management
• Skills we bring
• Examples of what we can do

12/11/2013

Outline
Access Innovations, Inc..

• Data is the new oil – we have to learn how to
mine it! Qatar – European Commission Report
• $ 7 trillion economic value in 7 US sectors alone
• $90 B annually in sensitive devices
• An insurance firm with 5 terabytes of data on
share drives pays $1.5 m per year
• New McKinsey 4th factor of production
• Land, Labor, Capital, + Data

12/11/2013

Why do we care about Big Data?
Access Innovations, Inc..

12/11/2013

The Data Deluge – Wired 16.07
Access Innovations, Inc..

• Google, eBay, LinkedIn, and Facebook were built
around Big Data from the beginning.
• No need to reconcile or integrate Big Data with
more traditional sources of data and the
analytics performed upon them
• No merging Big Data technologies with their
traditional IT infrastructures
• Big Data could stand alone, Big Data analytics
could be the only focus of analytics
• Big Data technology architectures could be the
only architecture.

12/11/2013

Big Data Born
Access Innovations, Inc..

• Large, well-established orgs
• Must be integrated with everything else that’s
going on in the company.
• Analytics on Big Data have to coexist with
analytics on other types of data.
• Hadoop clusters have to do their work alongside
IBM mainframes.
• Data scientists must somehow get along and
work jointly with mere quantitative analysts.

12/11/2013

Integrating Big Data
Access Innovations, Inc..

• Big Data is a term applied to data sets whose size
is beyond the ability of commonly used software
tools to capture, manage, and process the data
within a tolerable elapsed time. Big Data sizes
are a constantly moving target currently ranging
from a few dozen terabytes to many petabytes of
data in a single data set. – Wikipedia, May 2011
• “Unstructured”
• Terabytes, petabytes, zettabytes
• Streaming

12/11/2013

What is Big Data?
Access Innovations, Inc..

12/11/2013

New kind of science?
Access Innovations, Inc..

• Volume, Velocity, Variety
• Ability to deal overwhelmed
• More about methods than data
• Location aware data
• Life streaming
• Insurance claims
• Hubble telescope
• CERN Collections
• Flight data

12/11/2013

New Special Collections
Access Innovations, Inc..

12/11/2013

www.researchobject.org
Access Innovations, Inc..

• Means untagged or unformatted
• PDF
• Word files
• File shares
• News feeds
• News Data feeds
• Images

12/11/2013

Unstructured data
Access Innovations, Inc..

• All data has some structure and more structure
possibilities
• PDF properties
• Word file properties
• File structures

12/11/2013

Bit of a misnomer
Access Innovations, Inc..

• PDF - property tables
• Word files – property tables
• Files shares – implied structure in file names
• News feeds – headers have metadata
• Hubbell telescope feeds
• Images

12/11/2013

Property tables
Access Innovations, Inc..

12/11/2013

PDF Properties
Access Innovations, Inc..

12/11/2013

File structures
12/11/2013

• XML tagged data
Access Innovations, Inc..

Structured data
Access Innovations, Inc..

• Data infrastructure challenges
• “taking diverse and heterogeneous data sets and
making them more homogeneous and usable”
• An opportunity?
• All that data – what can it tell us?
• Privacy
• Copyright
• Neurological impact
• Data collection methods

12/11/2013

What are the problems?
Access Innovations, Inc..

The Big Data Senior Steering Group (BDSSG) was
formed to identify current Big Data research and
development activities across the Federal
government, offer opportunities for coordination,
and identify what the goal of a national initiative
in this area would look like. Subsequently, in
March 2012, The White House Big Data R&D
Initiative was launched and the BDSSG continues
to work in four main areas to facilitate and further
the goals of the Initiative.

12/11/2013

New Government Initiative
Access Innovations, Inc..

• Fast-growing volume of digital data of digital
data
• Advance state-of-the-art core technologies
needed to collect, store, preserve, manage,
analyze, and share huge quantities of data.
• Harness these technologies to accelerate the
pace of discovery in science and engineering,
strengthen our national security, and transform
teaching and learning; and
• Expand the workforce needed to develop and
use Big Data technologies.

12/11/2013

The National Big Data R&D Initiative
•Big Data
•Data analytics;

•Educate and expand the Big Data workforce
•Improve key outcomes in economic growth,
job creation, education, health, energy,
sustainability, public safety, advanced
manufacturing, science and engineering, and
global development

Access Innovations, Inc..

•Advance supporting technologies

12/11/2013

Data to Knowledge to Action
Access Innovations, Inc..

• January 22, 2013 - Data on Data: Presenting
Stakeholder Alignment Data on the
Cyberinfrastructure for Earth System Science
• Presentation and discussion with Professor Joel
Cutcher-Gershenfeld. Professor CutcherGershenfeld presented information on the NSF
EarthCube initiative including stakeholder survey
data (approximately 850 responses).

12/11/2013

Data on Data
Access Innovations, Inc..

12/11/2013

Who is involved?
•
•
•
•
•
•

INTERAGENCY WORKING GROUPS
Cyber Security and Information Assurance
High End Computing
COMMUNITY OF PRACTICE (CoP)
· Faster Administration of Science and Technology
Education and Research
COORDINATING GROUPS
Human Computer Interaction and Information
Management
High Confidence Software and Systems
Large Scale Networking
Software Design and Productivity
Social, Economic, and Workforce Implications of IT

Access Innovations, Inc..

•
•
•
•
•

12/11/2013

Groups breakdown
Access Innovations, Inc..

• Big Data
• Cyber Physical Systems
• Cyber Security and Information Assurance Research and
Development
• Health Information Technology Research and
Development
• Wireless Spectrum Research and Development
• SUBGROUP
• Health Information Technology Innovation and
Development Environments Subgroup
• TEAMS
• Joint Engineering Team
• Middleware and Grid Interagency Coordinating Team

12/11/2013

SENIOR STEERING GROUPS (SSGs)
•
•
•
•

Local
Cloud
Remote
Streaming

• Undifferentiated
• Unstructured
• Needs organization
• Type of database structure
• RDBMS
• Object oriented

Access Innovations, Inc..

• Data on machines

12/11/2013

Content Organization
Access Innovations, Inc..

12/11/2013

RDBMS Connection

Taxonomy
term table
Public
repositories

Commercial
data sources

Agency data
sources

Search engine

Library
catalogs

Locally held
documents

Search engine

Search engine

Search engine

Search engine

spiders

Filtered
content

Search engine

Meta-Search Tool
TAXONOMY

Web portal

12/11/2013

INTERNET
(public)

Access Innovations, Inc..

Data feeds
Structure unstructured data
Create metadata
Where to put it?
Store metadata with the records
• HTML header
• Properties tables
• XML files

• Store metadata in a separate file
•
•
•
•
•
•

Database
Metadata repository
Search system
File structure
SharePoint application
Web interface

Access Innovations, Inc..

•
•
•
•

12/11/2013

Metadata options
• Uncontrolled list
• Name authority file
• Synonym set / ring
• Controlled vocabulary
• Taxonomy
• Thesaurus
• Ontology
Highly complex - $$$$
• Topic Map
LOTS OF OVERLAP!
• Semantic Network

12/11/2013

Not complex - $

Access Innovations, Inc..

Information retrieval starts with a
knowledge organization system
Synonyms

Taxonomy

Thesaurus

INCREASING COMPLEXITY / RICHNESS

Ambiguity control
Synonym control

Ambiguity control Ambiguity cont’l
Synonym control Synonym cont’l
Hierarchical rel’s Hierarchical rel’s
Associative rel’s

12/11/2013

List of words

Access Innovations, Inc..

Structure of
controlled vocabularies
TAXONOMY
ONTOLOGY

• See also (SA)

• Non-Preferred Term (NP)
• Used for (UF), See (S)

• Scope Note (SN)
• History (H)
THESAURUS

*a.k.a. subject term, heading,
node, index term, keyword,
category, descriptor, class

12/11/2013

• Main Term (MT) *
• Top Term (TT)
• Broader Terms (BT)
• Narrower Terms (NT)
• Narrower Term Instance
• Related Terms (RT)

Access Innovations, Inc..

Taxonomy / thesaurus
12/11/2013

Thesaurus
Term Record
view

Access Innovations, Inc..

Taxonomy
view
Reference: “Taxonomies: the vital tool of
information architecture”, www.tfpl.com.
Courtesy of Lillian Gassie, Naval
Postgraduate School, Monterey, CA

Access Innovations, Inc..

A taxonomy aspires to be:
• a correlation of the different functional, regional
and (possibly) national languages used by a
community of practice
• a support mechanism for navigation
• a support tool for search engines and knowledge
maps
• an authority for tagging documents and other
information objects
• a knowledge base in its own right

12/11/2013

Taxonomies in Context
Access Innovations, Inc..

• Link the taxonomy and indexing
• Always in sync with the industry
• Keep up to date with terminology
• Automatically index the old data
• Filter newsfeeds
• Search using the taxonomy
• File using the taxonomy
• Spell check using the taxonomy
• Link to translation system
• Catalog using the taxonomy

12/11/2013

Where to use a taxonomy
Courtesy of Lillian Gassie, Naval
Postgraduate
School, Monterey, CA

“Taxonomies are better created by
professional indexers or librarians
than by domain experts.”

12/11/2013

• Improves organization & structure
• Facilitates navigation
• Facilitates knowledge discovery
• Reduces effort
• Saves time

Access Innovations, Inc..

Value of a Taxonomy
• knowledge management system
• search retrieval system
• portal interface

Access Innovations, Inc..

• A well formed taxonomy is based on a thesaurus
• Provides a flexible platform for many views of
the taxonomy
• Allows fast deployment
• Is the basis for a good

12/11/2013

Taxonomies
Database
Management Establish rules for
term use
System Suggest indexing
terms

Thesaurus
tool

Indexing
tool

Validate terms
Add terms and rules
Change terms and rules
Delete terms and rules

Access Innovations, Inc..

Search thesaurus
Validate term entry
Block invalid terms
Record candidates

12/11/2013

A quick look behind the scenes
12/11/2013

Thesaurus
Term Record
view

Access Innovations, Inc..

Taxonomy
view
Suggested taxonomy descriptors
Access Innovations, Inc..

12/11/2013
Access Innovations, Inc..

• Create the metadata structure
• Gather the locations of the records
• Index in place?
• Point to the data?
• Add metadata to the record?
• Store metadata in separate “table”?
• Use in full text?
• Use in search?
• Link databases and data sets with APIs

12/11/2013

Workflow order
• Was: Card catalog or OPACs to books on shelf
• Now: EBSCO host to Mendeley, Zotero, ResearchGate,
Academic.edu, etc.
• Use an API or web call

• Web calls
• Call to another web platform

• APIs, written handshakes
• Z39.50 is one standard for libraries

Access Innovations, Inc..

• Link data cross platforms
• Not federated search
• Examples

12/11/2013

APIs and Web Calls
• Free text / full text
• Fielded / Faceted

• Parts of Search
• Presentation layer
• Caching
• Inverted index

Access Innovations, Inc..

• Search

12/11/2013

Discovery (Search)
• Autonomy / Verity

• Bayesian –
• FAST
• Lucene

• Faceted
• Endeca

• Boolean
• Dialog

• Ranking Algorithms
• Google

Access Innovations, Inc..

• Keyword –

12/11/2013

Kinds of search
Access Innovations, Inc..

• Query language
• Search technology
• Inverted index
• Ranking algorithms
• Other enhancements

12/11/2013

Parts of search
FAST ESP and Data Harmony
12/11/2013

Architecture
Core Architectural Components

MANAGEMENT API

Custom
Applications

Content
Push

FILTER
SERVER

CUSTOM
CONNECTOR
MAIstro

Agent DB

Data Harmony Governance API

Query

Alerts

Vertical
Applications
Portals

Results

NavTree & MAI Query

LOTUS NOTES
CONNECTOR

Email,
Groupware

Index DB

Pipeline

QUERY
PROCESSOR

Databases

DATABASE
CONNECTOR

Pipeline

DOCUMENT
PROCESSOR

FILE
TRAVERSER

CONTENT API

Files,
Documents

SEARCH
SERVER

QUERY API

WEB
CRAWLER

Web
Content

Custom
Front-Ends
Mobile
Devices

Access Innovations, Inc..

Administrator’s
Dashboard
Access Innovations, Inc..

12/11/2013

SLA search

Interpret search word
“competencies”
as taxonomy term
Professional
competencies
12/11/2013
Access Innovations, Inc..

Search “taxonomy”
in XML descriptor field
returns all documents
in that category  27
Search in original
metadata  1
Solution:
Include descriptors
with metadata!
Access Innovations, Inc..

• Data management layers
• Curation
• Preservation
• Archiving
• Storage
• Choose tools wisely
• Data mashups

12/11/2013

Management
• Sorting for production

• Level of effort to get to 85%
• Integration into the workflow is efficient

Access Innovations, Inc..

• Reach 85% precision to launch for productivity assisted
• Reach 85% for filtering or categorization

12/11/2013

Go – No Go
Access Innovations, Inc..

• Information infrastructure
• Information dynamics
• Tensions on data
• Design elements
• How and when to set the data framework
• Identifiers (taxonomies) Glue to hold data
together

12/11/2013

Learn to understand the parts
Access Innovations, Inc..

• Core cross disciplines
• Combo of humans and machines
• What is open
• What needs a gatekeeper
• Store locally
• Store in cloud
• Some things people do better

12/11/2013

Services
• Satisfice = satisfaction + suffice
• How good is good enough?
• How much error can you put up with?

• Hits, misses, noise

Access Innovations, Inc..

• 15 – 20% irrelevant returns / noise
• Amount of work needed to achieve 85% level
• How good is good enough?

12/11/2013

Benchmarks
• 1.5 synonyms per terms
• 7500 terms total

• Assume 85% accuracy
• Use assisted for indexing
• Use automatically for filtering

• Assume $55 per hour for staff
• Assume 10000 records for test batch

Access Innovations, Inc..

• Assume – 5000 term thesaurus

12/11/2013

ROI Calculations
•
•
•
•
•
•
•
•
•

Preferred and non-preferred terms
Find records – could be programmatic
Need to collect and review about 60 to get 20 appropriate ones
Review records – ensure they are correct
3 minutes average per final selected record
= 1 hour per term entry
1 minute to review a record (20 resulting records)
7500 terms at one hour per term = 7500 hours
Estimated at 7500 hours $55 / hour = $412,500

Access Innovations, Inc..

• Collect 20 items per thesaurus entry

12/11/2013

Co-occurrence - Training sets
•
•
•
•

Average rule takes 5 minutes
(Actually MUCH faster using a rule building tool)
5 x 1500 = 7500 minutes
125 hours x $55 = $6875

Access Innovations, Inc..

• 80% of rules built automatically
• 7500 x .8 = 6000
• 20% require complex rules

12/11/2013

Rulebuilding - rules
Access Innovations, Inc..

• Cost of auto system
• Cost of getting system ready
• Ongoing maintenance
• Increased efficiency
• Increased quality of retrieval
• Cost of legacy system maintenance

12/11/2013

ROI - Segments
• Cost of auto system- $500,000 (could be less, but the one used for this study
cost this amount)
• Cost of getting system ready

12/11/2013

ROI – Co-occurrence
• Estimated at 2 weeks programming $100 / hour = $8000

• Training sets
• Estimated at 7500 hours $55 / hour = $412,500
• Possible need to re-run training set several times

• Ongoing maintenance
• Estimated at 15% of purchase price for license = $75,000
• Quarterly retraining to keep up with new terms –
• Training sets for new terms 50 terms per quarter = 200 x $55=$11,000

•
•
•
•

Increased efficiency
Expected accuracy at 60%
Increased quality of retrieval
Cost of legacy system maintenance

Access Innovations, Inc..

• Programming support and integration
• Cost of auto system- $60,000
• Cost of getting system ready

12/11/2013

ROI – Rules system
• Programming support and integration
• Rule building
• Estimated at 125 hours at $55 / hour = $6850
• Possible need to re run training set several times

• Ongoing maintenance
• Estimated at 15% of purchase price for license = $9000
• Rule building for new terms, 50 terms per quarter
• 200 terms x .8 = 160 automatic
• 40 at 5 minutes per term = 200 minutes /60 = 3.33 hours x $55 = $183

•
•
•
•

Increased efficiency
Expected accuracy at 60%
Increased quality of retrieval
Cost of legacy system maintenance

Access Innovations, Inc..

• Estimated at 2 weeks programming at $100 / hour = $8000
• Year one

12/11/2013

ROI - Rules
• Years thereafter
• $9000 + $183 = $9183

• 85% accuracy

Access Innovations, Inc..

• $60,000 + $8,000 + $6,850 = $74,850
• Year one

12/11/2013

ROI – Co-occurrence
• Years thereafter
• $75000 + $11,000 = $86,000

• 60% accuracy

Access Innovations, Inc..

• $500,000 + $8,000 + $412,500 = $920,500
• Controlled – use rules
• New terms - use inference

Access Innovations, Inc..

• Novelty detection requires inference techniques
• Is needed to discover new terms
• Controlled vocabulary / thesaurus can be more
controlled and accurate approach
• Combination of the two would be best

12/11/2013

Summary on ROI
Client Data
Full Text

Tag and
Create
metadata

Put in
data base
with tags

Build
Search
inverted index

Automatic
Summarization

Search
Presentation
Layer

HTML, PDF,
Data Feeds, etc.

Machine Aided
Indexer (M.A.I.™)

Inline Tagging

Client
Client Taxonomy
taxonomy

Metadata and
Entity Extractor
Thesaurus Master

Create
user
interface

Access Innovations, Inc..

Gather
source
data

12/11/2013

Fully integrated with MOSS
The Workflow

Database
Repository

Search
Software

Increases
accuracy
Browse by Subject
Auto-completion
Broader Terms
Narrower Terms
Related Terms
Access Innovations, Inc..

12/11/2013

Inline Tagging

Shows the exact point where the concept
is mentioned
Mouse-over to view the term record
Statistical summary, showing the number
of times each term is mentioned in the
article
Automatic
Summarization

Full text, HTML,
PDF, data feeds

Inline Tagging

Metadata
Extractor

Apply terms
Rules Base

Thesaurus
Master

Client
Taxonomy

Data
Repository

Search
Software

User Interface
Web Portal

Access Innovations, Inc..

12/11/2013

Semantic Process
Raw Full
text data
feeds
Printed
source
materials
Data
Crawls on
data
sources

Source data

XIS Creation

XIS
repository

Load to
Search

Taxonomy
terms
MAI Concept
Extractor
MAI Rule
Base
Taxonomy
Thesaurus
Master
Add
metadata

Clean and enhance data

Search
Harmony
Display
Search

Search data

Access Innovations, Inc..

SQL for
ecommerce

12/11/2013

Database Plus Search Workflow
Creating
an
Inverted
File Index

1 Define key terminology
2 Thesaurus tools

• Features
• Functions
3 Costs

• Thesaurus construction
• Thesaurus tools
4 Why & when?

Access Innovations, Inc..

Outline of Presentation

12/11/2013

Sample DOCUMENT
&
1
2
3
4
construction
costs
define
features
functions

key
of
outline
presentation
terminology
thesaurus
tools
when
why

12/11/2013
Access Innovations, Inc..

Simple inverted file index
The terms from the “outline”
& - Stop
1 - Stop
2 - Stop
3 - Stop
4 - Stop
construction - L7, P2, SH
costs - L6, P1, H
define - L2, P1, H
features - L4, P1, SH
functions - L5, P1, SH

12/11/2013

key - L2, P2, H
of - Stop
outline - L1, P1, T
presentation - L1, P3, T
terminology - L2, P3, H
thesaurus - (1) - L3, P1, H
(2) - L7, P1, SH
(3) - L8, P1, SH
tools - (1) - L3, P2, H
(2) - L8, P2, SH
when - L9, P3, H
why - L9, P1, H

Access Innovations, Inc..

Complex inverted file index
Placement location
12/11/2013

Complex Search Farm
Query

Search Harmony
Presentation
Layer

Cleanup, etc.

Federators

Deploy
Hub

Repository XIS
(cache)

Cache
Builders
Source
Data

Access Innovations, Inc..

Query Servers

Index
Builders
• That is what librarians usually look at.
• What are the options coming?
• How can we encourage useful changes?

12/11/2013
Access Innovations, Inc..

What happens at the search
presentation layer?
Access Innovations, Inc..

12/11/2013

Semantic search options
Access Innovations, Inc..

12/11/2013
Access Innovations, Inc..

12/11/2013
Journal
Article on
Topic A

Grant Available for
Researchers Working
on Topic A

Author Networks
Social Networking

Based on an image by Helen Atkins

Upcoming
Conference on
Topic A

Job Posting
for Expert
on Topic A

Podcast Interview
with Researcher
Working on Topic A

Access Innovations, Inc..

Other Journal
Articles on
Topic A

CME Activity
on Topic A

12/11/2013

Data linked and served by metadata
Alcohol, Folate, Methionine, and Risk of Incident Breast Cancer in
the American Cancer Society Cancer Prevention Study II Nutrition
Cohort
Heather Spencer Feigelson1, Carolyn R. Jonas, Andreas S. Robertson,
Marjorie L. McCullough, Michael J. Thun and Eugenia E. Calle
Department of Epidemiology and Surveillance Research, American
Cancer Society, National Home Office, Atlanta, Georgia 30329-4251
Recent studies suggest that the increased risk of breast cancer
associated with alcohol consumption may be reduced by adequate
folate intake. We examined this question among 66,561
postmenopausal women in the American Cancer Society Cancer
Prevention Study II Nutrition Cohort. Think Tank Report
Related Working Groups
•Finance
Related Think Tank Report Content
•Charter
•Molecular Epidemiology
Webcasts
Related Awards
Related Webcasts
•AACR-GlaxoSmithKline Clinical Cancer Research Scholar
Awards
•ACS Award
•Weinstein Distinguished Lecture

Related Press Releases
12/11/2013

•How What and How Much We Eat (And Drink) Affects Our Risk of
Cancer
•Novel COX-2 Combination Treatment May Reduce Colon Cancer Risk
Combination Regimen of COX-2 Inhibitor and Fish Oil Causes Cell
Death
•COX-2 Levels Are Elevated in Smokers

Related AACR Workshops and Conferences
•Frontiers in Cancer Prevention Research
•Continuing Medical Education (CME)
•Molecular Targets and Cancer Therapeutics
Related Meeting Abstracts

Access Innovations, Inc..

Cancer Epidemiology Biomarkers & Prevention
Vol. 12, 161-164,
February 2003
© 2003 American Association for Cancer Research
Short Communications

•Association between dietary folate intake, alcohol intake, and
methylenetetrahydrofolate reductase C677T and A1298C
polymorphisms and subsequent breast
•Folate, folate cofactor, and alcohol intakes and risk for colorectal
adenoma
•Dietary folate intake and risk of prostate cancer in a large prospective
cohort study

Related Education Book Content
Oral Contraceptives, Postmenopausal Hormones, and
Breast Cancer
Physical Activity and Cancer
Hormonal Interventions: From Adjuvant Therapy to Breast
Cancer Prevention

After Helen Atkins
Access Innovations, Inc..

12/11/2013

All data up-posted to the top level
This is a radial
graph of “plosthes”.
The number of
records for which
each index term
occurs is reflected
by circle sizes.

Access Innovations, Inc..

12/11/2013

Visualize your tagged data
Access Innovations, Inc..

12/11/2013

Semantic Hierarchy
browsable tree
Access Innovations, Inc..

12/11/2013

Related Terms
semantic web
Access Innovations, Inc..

12/11/2013

Synonyms
search foundation
Access Innovations, Inc..

12/11/2013

Load to a visualization program
such as Prefuse
• Migration patterns
• YouTube
• Citizen Science

Access Innovations, Inc..

• Drawing together information from several
sources
• Overlay on additional surfaces – like maps
• Fun distribution of data
• Cornell School of Ornithology

12/11/2013

Data Mashups
• ISI does it with references
• AIP’s UniPHY does it using semantics

12/11/2013

Scientific social networking
basedciting who like
on metadata
• Idea has been here - Who is
• good metadata and descriptors

See the authors connections
Map who is working in the
field and where

Access Innovations, Inc..

• Expand your options using
Access Innovations, Inc..

12/11/2013

Authors at a Place
• With interpretive layer – computer assisted

• Links to data
• Open data / supplemental data
• Too much is not enough!

Access Innovations, Inc..

• Document systems replaced
• New infrastructure
• Institutional repositories not good long term for Big
Data
• Need to operate at scale
• Integrated, ecosystem to infrastructure
• Replace customized human mediated

12/11/2013

Future?
• Go down hall and ask

• Big Data
• Too many people to ask
• Lose provenance
• Hard to differentiate the layers

Access Innovations, Inc..

• Not by size of collection
• By the services they offer
• More services more competitive
• Spreadsheet science

12/11/2013

Differentiation
People, data usage
Preservation and discovery
Publishers’ content, not format
Visualization of Big Data sets
Reproducibility of research
Shared knowledge
Open peer review
•
•
•
•
•

Researcher.org
Galaxy zoo
Zooniverse
900,000 citizen scientists
Wikipedia

Access Innovations, Inc..

•
•
•
•
•
•
•

12/11/2013

New ways of doing things
Access Innovations, Inc..

12/11/2013
• Weeding
• Redefining collections
• Collection development

• Skills to apply to Big Data
• Vocabulary development
• Search expertise
• Reference ability

• What to throw away
• What to keep for discoverability
• Need metadata to preserve
• Better discoverability
• Easier preservation
• Keep the data provenance

Access Innovations, Inc..

• Librarians

12/11/2013

Skills we bring
Access Innovations, Inc..

• Big Data – What is it?
• New Government Initiative
• Content organization
• Discovery (Search)
• Management
• Skills we bring
• Examples of what we can do

12/11/2013

We covered
• Semantic Enrichment Services Provider
• We change search to find!

• Creator of Data Harmony tools
• More than 600 taxonomies created
• More than 2000 engagements
• Financed by Sweat, Persistence, and Good Cash Flow
Management
• Accurate, on time, under budget!

Access Innovations, Inc..

• Access Innovations –

12/11/2013

Who are we?
• Marjorie Hlava

• mhlava@accessinn.com

• +1-505-998-0800
• Access Innovations/Data Harmony/Access Integrity
• Taxodiary.com (blog)
• Taxobank.org (reference tool for taxonomists)

12/11/2013

Thank you

Access Innovations, Inc..

Slides at
www.accessinn.com/presentations

Mais conteúdo relacionado

Mais procurados

Data catalog
Data catalogData catalog
Data catalogiamtodor
 
Introduction to Data Management Planning
Introduction to Data Management PlanningIntroduction to Data Management Planning
Introduction to Data Management PlanningSarah Jones
 
000 introduction to big data analytics 2021
000   introduction to big data analytics  2021000   introduction to big data analytics  2021
000 introduction to big data analytics 2021Dendej Sawarnkatat
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
 
Documaster – The true value of documents
Documaster – The true value of documentsDocumaster – The true value of documents
Documaster – The true value of documentsNXC Switzerland
 
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Katie Whipkey
 
Perspectives from the African Open Science Platform/Susan Veldsman
Perspectives from the African Open Science Platform/Susan VeldsmanPerspectives from the African Open Science Platform/Susan Veldsman
Perspectives from the African Open Science Platform/Susan VeldsmanAfrican Open Science Platform
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldDez Blanchfield
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupEdward Curry
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification303Computing
 
challenges of big data to big data mining with their processing framework
challenges of big data to big data mining with their processing frameworkchallenges of big data to big data mining with their processing framework
challenges of big data to big data mining with their processing frameworkKamleshKumar394
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC
 
Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...Research Data Alliance
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applicationsPadma Metta
 
Data science courses part 4
Data science courses   part 4Data science courses   part 4
Data science courses part 4DataMites
 

Mais procurados (19)

Data catalog
Data catalogData catalog
Data catalog
 
Introduction to Data Management Planning
Introduction to Data Management PlanningIntroduction to Data Management Planning
Introduction to Data Management Planning
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
000 introduction to big data analytics 2021
000   introduction to big data analytics  2021000   introduction to big data analytics  2021
000 introduction to big data analytics 2021
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
Documaster – The true value of documents
Documaster – The true value of documentsDocumaster – The true value of documents
Documaster – The true value of documents
 
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
 
Perspectives from the African Open Science Platform/Susan Veldsman
Perspectives from the African Open Science Platform/Susan VeldsmanPerspectives from the African Open Science Platform/Susan Veldsman
Perspectives from the African Open Science Platform/Susan Veldsman
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
 
SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
 
challenges of big data to big data mining with their processing framework
challenges of big data to big data mining with their processing frameworkchallenges of big data to big data mining with their processing framework
challenges of big data to big data mining with their processing framework
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
 
Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applications
 
Big Data Presentation
Big Data PresentationBig Data Presentation
Big Data Presentation
 
Data science courses part 4
Data science courses   part 4Data science courses   part 4
Data science courses part 4
 

Semelhante a Big Data Content Organization, Discovery, and Management

Big Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and ManagementBig Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and ManagementAccess Innovations, Inc.
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorialJosh Young
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open ScienceMark Parsons
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementTony Bain
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media suresh sood
 
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
Infrastructure, relationships, trust, and RDA
Infrastructure, relationships, trust, and RDAInfrastructure, relationships, trust, and RDA
Infrastructure, relationships, trust, and RDAResearch Data Alliance
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
RDA - The Research Data Alliance in a Nutshell
RDA - The Research Data Alliance in a NutshellRDA - The Research Data Alliance in a Nutshell
RDA - The Research Data Alliance in a NutshellResearch Data Alliance
 
A Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramA Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramDataWorks Summit
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...robkitchin
 

Semelhante a Big Data Content Organization, Discovery, and Management (20)

Big Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and ManagementBig Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and Management
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Big data
Big dataBig data
Big data
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data Management
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Sample
Sample Sample
Sample
 
Infrastructure, relationships, trust, and RDA
Infrastructure, relationships, trust, and RDAInfrastructure, relationships, trust, and RDA
Infrastructure, relationships, trust, and RDA
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
RDA - The Research Data Alliance in a Nutshell
RDA - The Research Data Alliance in a NutshellRDA - The Research Data Alliance in a Nutshell
RDA - The Research Data Alliance in a Nutshell
 
M.Florence Dayana
M.Florence DayanaM.Florence Dayana
M.Florence Dayana
 
A Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramA Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management Program
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
The Impact of the Data Revolution on Official Statistics: Opportunities, Chal...
 

Mais de Access Innovations, Inc.

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsAccess Innovations, Inc.
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8Access Innovations, Inc.
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Access Innovations, Inc.
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Access Innovations, Inc.
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Access Innovations, Inc.
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut ItAccess Innovations, Inc.
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedAccess Innovations, Inc.
 

Mais de Access Innovations, Inc. (20)

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
 
Smart submit
Smart submitSmart submit
Smart submit
 
Plos taxonomy beyond search dhug 2021
Plos taxonomy beyond search   dhug 2021Plos taxonomy beyond search   dhug 2021
Plos taxonomy beyond search dhug 2021
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)
 
Data harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacingData harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacing
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
 
Atypon dhug2021
Atypon dhug2021Atypon dhug2021
Atypon dhug2021
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021
 
Asce more than just topic taxonomies
Asce more than just topic taxonomiesAsce more than just topic taxonomies
Asce more than just topic taxonomies
 
Acs discoverability-dhug2021
Acs discoverability-dhug2021Acs discoverability-dhug2021
Acs discoverability-dhug2021
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut It
 
Health Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut ItHealth Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut It
 
Why Keywords Don't Cut It
Why Keywords Don't Cut ItWhy Keywords Don't Cut It
Why Keywords Don't Cut It
 
Data Harmony update 2020 final
Data Harmony update 2020 finalData Harmony update 2020 final
Data Harmony update 2020 final
 
Data Harmony Update 2020 final
Data Harmony Update 2020 finalData Harmony Update 2020 final
Data Harmony Update 2020 final
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
DHUG 2018 - Florida Thesis OCR
DHUG 2018 - Florida Thesis OCRDHUG 2018 - Florida Thesis OCR
DHUG 2018 - Florida Thesis OCR
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
 

Último

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Big Data Content Organization, Discovery, and Management

  • 2. Marjorie M.K. Hlava President Access Innovations, Inc. Mhlava@accessinn.com 12/11/2013 Access Innovations, Inc.. Big Data Content Organization, Discovery, and Management
  • 3. Access Innovations, Inc.. • Big Data • New Government Initiative • Content Organization • Discovery (Search) • Management • Skills we bring • Examples of what we can do 12/11/2013 Outline
  • 4. Access Innovations, Inc.. • Data is the new oil – we have to learn how to mine it! Qatar – European Commission Report • $ 7 trillion economic value in 7 US sectors alone • $90 B annually in sensitive devices • An insurance firm with 5 terabytes of data on share drives pays $1.5 m per year • New McKinsey 4th factor of production • Land, Labor, Capital, + Data 12/11/2013 Why do we care about Big Data?
  • 5. Access Innovations, Inc.. 12/11/2013 The Data Deluge – Wired 16.07
  • 6. Access Innovations, Inc.. • Google, eBay, LinkedIn, and Facebook were built around Big Data from the beginning. • No need to reconcile or integrate Big Data with more traditional sources of data and the analytics performed upon them • No merging Big Data technologies with their traditional IT infrastructures • Big Data could stand alone, Big Data analytics could be the only focus of analytics • Big Data technology architectures could be the only architecture. 12/11/2013 Big Data Born
  • 7. Access Innovations, Inc.. • Large, well-established orgs • Must be integrated with everything else that’s going on in the company. • Analytics on Big Data have to coexist with analytics on other types of data. • Hadoop clusters have to do their work alongside IBM mainframes. • Data scientists must somehow get along and work jointly with mere quantitative analysts. 12/11/2013 Integrating Big Data
  • 8. Access Innovations, Inc.. • Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. – Wikipedia, May 2011 • “Unstructured” • Terabytes, petabytes, zettabytes • Streaming 12/11/2013 What is Big Data?
  • 10. Access Innovations, Inc.. • Volume, Velocity, Variety • Ability to deal overwhelmed • More about methods than data • Location aware data • Life streaming • Insurance claims • Hubble telescope • CERN Collections • Flight data 12/11/2013 New Special Collections
  • 12. Access Innovations, Inc.. • Means untagged or unformatted • PDF • Word files • File shares • News feeds • News Data feeds • Images 12/11/2013 Unstructured data
  • 13. Access Innovations, Inc.. • All data has some structure and more structure possibilities • PDF properties • Word file properties • File structures 12/11/2013 Bit of a misnomer
  • 14. Access Innovations, Inc.. • PDF - property tables • Word files – property tables • Files shares – implied structure in file names • News feeds – headers have metadata • Hubbell telescope feeds • Images 12/11/2013 Property tables
  • 17. 12/11/2013 • XML tagged data Access Innovations, Inc.. Structured data
  • 18. Access Innovations, Inc.. • Data infrastructure challenges • “taking diverse and heterogeneous data sets and making them more homogeneous and usable” • An opportunity? • All that data – what can it tell us? • Privacy • Copyright • Neurological impact • Data collection methods 12/11/2013 What are the problems?
  • 19. Access Innovations, Inc.. The Big Data Senior Steering Group (BDSSG) was formed to identify current Big Data research and development activities across the Federal government, offer opportunities for coordination, and identify what the goal of a national initiative in this area would look like. Subsequently, in March 2012, The White House Big Data R&D Initiative was launched and the BDSSG continues to work in four main areas to facilitate and further the goals of the Initiative. 12/11/2013 New Government Initiative
  • 20. Access Innovations, Inc.. • Fast-growing volume of digital data of digital data • Advance state-of-the-art core technologies needed to collect, store, preserve, manage, analyze, and share huge quantities of data. • Harness these technologies to accelerate the pace of discovery in science and engineering, strengthen our national security, and transform teaching and learning; and • Expand the workforce needed to develop and use Big Data technologies. 12/11/2013 The National Big Data R&D Initiative
  • 21. •Big Data •Data analytics; •Educate and expand the Big Data workforce •Improve key outcomes in economic growth, job creation, education, health, energy, sustainability, public safety, advanced manufacturing, science and engineering, and global development Access Innovations, Inc.. •Advance supporting technologies 12/11/2013 Data to Knowledge to Action
  • 22. Access Innovations, Inc.. • January 22, 2013 - Data on Data: Presenting Stakeholder Alignment Data on the Cyberinfrastructure for Earth System Science • Presentation and discussion with Professor Joel Cutcher-Gershenfeld. Professor CutcherGershenfeld presented information on the NSF EarthCube initiative including stakeholder survey data (approximately 850 responses). 12/11/2013 Data on Data
  • 24. • • • • • • INTERAGENCY WORKING GROUPS Cyber Security and Information Assurance High End Computing COMMUNITY OF PRACTICE (CoP) · Faster Administration of Science and Technology Education and Research COORDINATING GROUPS Human Computer Interaction and Information Management High Confidence Software and Systems Large Scale Networking Software Design and Productivity Social, Economic, and Workforce Implications of IT Access Innovations, Inc.. • • • • • 12/11/2013 Groups breakdown
  • 25. Access Innovations, Inc.. • Big Data • Cyber Physical Systems • Cyber Security and Information Assurance Research and Development • Health Information Technology Research and Development • Wireless Spectrum Research and Development • SUBGROUP • Health Information Technology Innovation and Development Environments Subgroup • TEAMS • Joint Engineering Team • Middleware and Grid Interagency Coordinating Team 12/11/2013 SENIOR STEERING GROUPS (SSGs)
  • 26. • • • • Local Cloud Remote Streaming • Undifferentiated • Unstructured • Needs organization • Type of database structure • RDBMS • Object oriented Access Innovations, Inc.. • Data on machines 12/11/2013 Content Organization
  • 27. Access Innovations, Inc.. 12/11/2013 RDBMS Connection Taxonomy term table
  • 28. Public repositories Commercial data sources Agency data sources Search engine Library catalogs Locally held documents Search engine Search engine Search engine Search engine spiders Filtered content Search engine Meta-Search Tool TAXONOMY Web portal 12/11/2013 INTERNET (public) Access Innovations, Inc.. Data feeds
  • 29. Structure unstructured data Create metadata Where to put it? Store metadata with the records • HTML header • Properties tables • XML files • Store metadata in a separate file • • • • • • Database Metadata repository Search system File structure SharePoint application Web interface Access Innovations, Inc.. • • • • 12/11/2013 Metadata options
  • 30. • Uncontrolled list • Name authority file • Synonym set / ring • Controlled vocabulary • Taxonomy • Thesaurus • Ontology Highly complex - $$$$ • Topic Map LOTS OF OVERLAP! • Semantic Network 12/11/2013 Not complex - $ Access Innovations, Inc.. Information retrieval starts with a knowledge organization system
  • 31. Synonyms Taxonomy Thesaurus INCREASING COMPLEXITY / RICHNESS Ambiguity control Synonym control Ambiguity control Ambiguity cont’l Synonym control Synonym cont’l Hierarchical rel’s Hierarchical rel’s Associative rel’s 12/11/2013 List of words Access Innovations, Inc.. Structure of controlled vocabularies
  • 32. TAXONOMY ONTOLOGY • See also (SA) • Non-Preferred Term (NP) • Used for (UF), See (S) • Scope Note (SN) • History (H) THESAURUS *a.k.a. subject term, heading, node, index term, keyword, category, descriptor, class 12/11/2013 • Main Term (MT) * • Top Term (TT) • Broader Terms (BT) • Narrower Terms (NT) • Narrower Term Instance • Related Terms (RT) Access Innovations, Inc.. Taxonomy / thesaurus
  • 34. Reference: “Taxonomies: the vital tool of information architecture”, www.tfpl.com. Courtesy of Lillian Gassie, Naval Postgraduate School, Monterey, CA Access Innovations, Inc.. A taxonomy aspires to be: • a correlation of the different functional, regional and (possibly) national languages used by a community of practice • a support mechanism for navigation • a support tool for search engines and knowledge maps • an authority for tagging documents and other information objects • a knowledge base in its own right 12/11/2013 Taxonomies in Context
  • 35. Access Innovations, Inc.. • Link the taxonomy and indexing • Always in sync with the industry • Keep up to date with terminology • Automatically index the old data • Filter newsfeeds • Search using the taxonomy • File using the taxonomy • Spell check using the taxonomy • Link to translation system • Catalog using the taxonomy 12/11/2013 Where to use a taxonomy
  • 36. Courtesy of Lillian Gassie, Naval Postgraduate School, Monterey, CA “Taxonomies are better created by professional indexers or librarians than by domain experts.” 12/11/2013 • Improves organization & structure • Facilitates navigation • Facilitates knowledge discovery • Reduces effort • Saves time Access Innovations, Inc.. Value of a Taxonomy
  • 37. • knowledge management system • search retrieval system • portal interface Access Innovations, Inc.. • A well formed taxonomy is based on a thesaurus • Provides a flexible platform for many views of the taxonomy • Allows fast deployment • Is the basis for a good 12/11/2013 Taxonomies
  • 38. Database Management Establish rules for term use System Suggest indexing terms Thesaurus tool Indexing tool Validate terms Add terms and rules Change terms and rules Delete terms and rules Access Innovations, Inc.. Search thesaurus Validate term entry Block invalid terms Record candidates 12/11/2013 A quick look behind the scenes
  • 40. Suggested taxonomy descriptors Access Innovations, Inc.. 12/11/2013
  • 41. Access Innovations, Inc.. • Create the metadata structure • Gather the locations of the records • Index in place? • Point to the data? • Add metadata to the record? • Store metadata in separate “table”? • Use in full text? • Use in search? • Link databases and data sets with APIs 12/11/2013 Workflow order
  • 42. • Was: Card catalog or OPACs to books on shelf • Now: EBSCO host to Mendeley, Zotero, ResearchGate, Academic.edu, etc. • Use an API or web call • Web calls • Call to another web platform • APIs, written handshakes • Z39.50 is one standard for libraries Access Innovations, Inc.. • Link data cross platforms • Not federated search • Examples 12/11/2013 APIs and Web Calls
  • 43. • Free text / full text • Fielded / Faceted • Parts of Search • Presentation layer • Caching • Inverted index Access Innovations, Inc.. • Search 12/11/2013 Discovery (Search)
  • 44. • Autonomy / Verity • Bayesian – • FAST • Lucene • Faceted • Endeca • Boolean • Dialog • Ranking Algorithms • Google Access Innovations, Inc.. • Keyword – 12/11/2013 Kinds of search
  • 45. Access Innovations, Inc.. • Query language • Search technology • Inverted index • Ranking algorithms • Other enhancements 12/11/2013 Parts of search
  • 46. FAST ESP and Data Harmony 12/11/2013 Architecture Core Architectural Components MANAGEMENT API Custom Applications Content Push FILTER SERVER CUSTOM CONNECTOR MAIstro Agent DB Data Harmony Governance API Query Alerts Vertical Applications Portals Results NavTree & MAI Query LOTUS NOTES CONNECTOR Email, Groupware Index DB Pipeline QUERY PROCESSOR Databases DATABASE CONNECTOR Pipeline DOCUMENT PROCESSOR FILE TRAVERSER CONTENT API Files, Documents SEARCH SERVER QUERY API WEB CRAWLER Web Content Custom Front-Ends Mobile Devices Access Innovations, Inc.. Administrator’s Dashboard
  • 47. Access Innovations, Inc.. 12/11/2013 SLA search Interpret search word “competencies” as taxonomy term Professional competencies
  • 48. 12/11/2013 Access Innovations, Inc.. Search “taxonomy” in XML descriptor field returns all documents in that category  27 Search in original metadata  1 Solution: Include descriptors with metadata!
  • 49. Access Innovations, Inc.. • Data management layers • Curation • Preservation • Archiving • Storage • Choose tools wisely • Data mashups 12/11/2013 Management
  • 50. • Sorting for production • Level of effort to get to 85% • Integration into the workflow is efficient Access Innovations, Inc.. • Reach 85% precision to launch for productivity assisted • Reach 85% for filtering or categorization 12/11/2013 Go – No Go
  • 51. Access Innovations, Inc.. • Information infrastructure • Information dynamics • Tensions on data • Design elements • How and when to set the data framework • Identifiers (taxonomies) Glue to hold data together 12/11/2013 Learn to understand the parts
  • 52. Access Innovations, Inc.. • Core cross disciplines • Combo of humans and machines • What is open • What needs a gatekeeper • Store locally • Store in cloud • Some things people do better 12/11/2013 Services
  • 53. • Satisfice = satisfaction + suffice • How good is good enough? • How much error can you put up with? • Hits, misses, noise Access Innovations, Inc.. • 15 – 20% irrelevant returns / noise • Amount of work needed to achieve 85% level • How good is good enough? 12/11/2013 Benchmarks
  • 54. • 1.5 synonyms per terms • 7500 terms total • Assume 85% accuracy • Use assisted for indexing • Use automatically for filtering • Assume $55 per hour for staff • Assume 10000 records for test batch Access Innovations, Inc.. • Assume – 5000 term thesaurus 12/11/2013 ROI Calculations
  • 55. • • • • • • • • • Preferred and non-preferred terms Find records – could be programmatic Need to collect and review about 60 to get 20 appropriate ones Review records – ensure they are correct 3 minutes average per final selected record = 1 hour per term entry 1 minute to review a record (20 resulting records) 7500 terms at one hour per term = 7500 hours Estimated at 7500 hours $55 / hour = $412,500 Access Innovations, Inc.. • Collect 20 items per thesaurus entry 12/11/2013 Co-occurrence - Training sets
  • 56. • • • • Average rule takes 5 minutes (Actually MUCH faster using a rule building tool) 5 x 1500 = 7500 minutes 125 hours x $55 = $6875 Access Innovations, Inc.. • 80% of rules built automatically • 7500 x .8 = 6000 • 20% require complex rules 12/11/2013 Rulebuilding - rules
  • 57. Access Innovations, Inc.. • Cost of auto system • Cost of getting system ready • Ongoing maintenance • Increased efficiency • Increased quality of retrieval • Cost of legacy system maintenance 12/11/2013 ROI - Segments
  • 58. • Cost of auto system- $500,000 (could be less, but the one used for this study cost this amount) • Cost of getting system ready 12/11/2013 ROI – Co-occurrence • Estimated at 2 weeks programming $100 / hour = $8000 • Training sets • Estimated at 7500 hours $55 / hour = $412,500 • Possible need to re-run training set several times • Ongoing maintenance • Estimated at 15% of purchase price for license = $75,000 • Quarterly retraining to keep up with new terms – • Training sets for new terms 50 terms per quarter = 200 x $55=$11,000 • • • • Increased efficiency Expected accuracy at 60% Increased quality of retrieval Cost of legacy system maintenance Access Innovations, Inc.. • Programming support and integration
  • 59. • Cost of auto system- $60,000 • Cost of getting system ready 12/11/2013 ROI – Rules system • Programming support and integration • Rule building • Estimated at 125 hours at $55 / hour = $6850 • Possible need to re run training set several times • Ongoing maintenance • Estimated at 15% of purchase price for license = $9000 • Rule building for new terms, 50 terms per quarter • 200 terms x .8 = 160 automatic • 40 at 5 minutes per term = 200 minutes /60 = 3.33 hours x $55 = $183 • • • • Increased efficiency Expected accuracy at 60% Increased quality of retrieval Cost of legacy system maintenance Access Innovations, Inc.. • Estimated at 2 weeks programming at $100 / hour = $8000
  • 60. • Year one 12/11/2013 ROI - Rules • Years thereafter • $9000 + $183 = $9183 • 85% accuracy Access Innovations, Inc.. • $60,000 + $8,000 + $6,850 = $74,850
  • 61. • Year one 12/11/2013 ROI – Co-occurrence • Years thereafter • $75000 + $11,000 = $86,000 • 60% accuracy Access Innovations, Inc.. • $500,000 + $8,000 + $412,500 = $920,500
  • 62. • Controlled – use rules • New terms - use inference Access Innovations, Inc.. • Novelty detection requires inference techniques • Is needed to discover new terms • Controlled vocabulary / thesaurus can be more controlled and accurate approach • Combination of the two would be best 12/11/2013 Summary on ROI
  • 63. Client Data Full Text Tag and Create metadata Put in data base with tags Build Search inverted index Automatic Summarization Search Presentation Layer HTML, PDF, Data Feeds, etc. Machine Aided Indexer (M.A.I.™) Inline Tagging Client Client Taxonomy taxonomy Metadata and Entity Extractor Thesaurus Master Create user interface Access Innovations, Inc.. Gather source data 12/11/2013 Fully integrated with MOSS The Workflow Database Repository Search Software Increases accuracy Browse by Subject Auto-completion Broader Terms Narrower Terms Related Terms
  • 64. Access Innovations, Inc.. 12/11/2013 Inline Tagging Shows the exact point where the concept is mentioned Mouse-over to view the term record Statistical summary, showing the number of times each term is mentioned in the article
  • 65. Automatic Summarization Full text, HTML, PDF, data feeds Inline Tagging Metadata Extractor Apply terms Rules Base Thesaurus Master Client Taxonomy Data Repository Search Software User Interface Web Portal Access Innovations, Inc.. 12/11/2013 Semantic Process
  • 66. Raw Full text data feeds Printed source materials Data Crawls on data sources Source data XIS Creation XIS repository Load to Search Taxonomy terms MAI Concept Extractor MAI Rule Base Taxonomy Thesaurus Master Add metadata Clean and enhance data Search Harmony Display Search Search data Access Innovations, Inc.. SQL for ecommerce 12/11/2013 Database Plus Search Workflow
  • 67. Creating an Inverted File Index 1 Define key terminology 2 Thesaurus tools • Features • Functions 3 Costs • Thesaurus construction • Thesaurus tools 4 Why & when? Access Innovations, Inc.. Outline of Presentation 12/11/2013 Sample DOCUMENT
  • 69. & - Stop 1 - Stop 2 - Stop 3 - Stop 4 - Stop construction - L7, P2, SH costs - L6, P1, H define - L2, P1, H features - L4, P1, SH functions - L5, P1, SH 12/11/2013 key - L2, P2, H of - Stop outline - L1, P1, T presentation - L1, P3, T terminology - L2, P3, H thesaurus - (1) - L3, P1, H (2) - L7, P1, SH (3) - L8, P1, SH tools - (1) - L3, P2, H (2) - L8, P2, SH when - L9, P3, H why - L9, P1, H Access Innovations, Inc.. Complex inverted file index Placement location
  • 70. 12/11/2013 Complex Search Farm Query Search Harmony Presentation Layer Cleanup, etc. Federators Deploy Hub Repository XIS (cache) Cache Builders Source Data Access Innovations, Inc.. Query Servers Index Builders
  • 71. • That is what librarians usually look at. • What are the options coming? • How can we encourage useful changes? 12/11/2013 Access Innovations, Inc.. What happens at the search presentation layer?
  • 75. Journal Article on Topic A Grant Available for Researchers Working on Topic A Author Networks Social Networking Based on an image by Helen Atkins Upcoming Conference on Topic A Job Posting for Expert on Topic A Podcast Interview with Researcher Working on Topic A Access Innovations, Inc.. Other Journal Articles on Topic A CME Activity on Topic A 12/11/2013 Data linked and served by metadata
  • 76. Alcohol, Folate, Methionine, and Risk of Incident Breast Cancer in the American Cancer Society Cancer Prevention Study II Nutrition Cohort Heather Spencer Feigelson1, Carolyn R. Jonas, Andreas S. Robertson, Marjorie L. McCullough, Michael J. Thun and Eugenia E. Calle Department of Epidemiology and Surveillance Research, American Cancer Society, National Home Office, Atlanta, Georgia 30329-4251 Recent studies suggest that the increased risk of breast cancer associated with alcohol consumption may be reduced by adequate folate intake. We examined this question among 66,561 postmenopausal women in the American Cancer Society Cancer Prevention Study II Nutrition Cohort. Think Tank Report Related Working Groups •Finance Related Think Tank Report Content •Charter •Molecular Epidemiology Webcasts Related Awards Related Webcasts •AACR-GlaxoSmithKline Clinical Cancer Research Scholar Awards •ACS Award •Weinstein Distinguished Lecture Related Press Releases 12/11/2013 •How What and How Much We Eat (And Drink) Affects Our Risk of Cancer •Novel COX-2 Combination Treatment May Reduce Colon Cancer Risk Combination Regimen of COX-2 Inhibitor and Fish Oil Causes Cell Death •COX-2 Levels Are Elevated in Smokers Related AACR Workshops and Conferences •Frontiers in Cancer Prevention Research •Continuing Medical Education (CME) •Molecular Targets and Cancer Therapeutics Related Meeting Abstracts Access Innovations, Inc.. Cancer Epidemiology Biomarkers & Prevention Vol. 12, 161-164, February 2003 © 2003 American Association for Cancer Research Short Communications •Association between dietary folate intake, alcohol intake, and methylenetetrahydrofolate reductase C677T and A1298C polymorphisms and subsequent breast •Folate, folate cofactor, and alcohol intakes and risk for colorectal adenoma •Dietary folate intake and risk of prostate cancer in a large prospective cohort study Related Education Book Content Oral Contraceptives, Postmenopausal Hormones, and Breast Cancer Physical Activity and Cancer Hormonal Interventions: From Adjuvant Therapy to Breast Cancer Prevention After Helen Atkins
  • 77. Access Innovations, Inc.. 12/11/2013 All data up-posted to the top level
  • 78. This is a radial graph of “plosthes”. The number of records for which each index term occurs is reflected by circle sizes. Access Innovations, Inc.. 12/11/2013 Visualize your tagged data
  • 82. Access Innovations, Inc.. 12/11/2013 Load to a visualization program such as Prefuse
  • 83. • Migration patterns • YouTube • Citizen Science Access Innovations, Inc.. • Drawing together information from several sources • Overlay on additional surfaces – like maps • Fun distribution of data • Cornell School of Ornithology 12/11/2013 Data Mashups
  • 84. • ISI does it with references • AIP’s UniPHY does it using semantics 12/11/2013 Scientific social networking basedciting who like on metadata • Idea has been here - Who is • good metadata and descriptors See the authors connections Map who is working in the field and where Access Innovations, Inc.. • Expand your options using
  • 86. • With interpretive layer – computer assisted • Links to data • Open data / supplemental data • Too much is not enough! Access Innovations, Inc.. • Document systems replaced • New infrastructure • Institutional repositories not good long term for Big Data • Need to operate at scale • Integrated, ecosystem to infrastructure • Replace customized human mediated 12/11/2013 Future?
  • 87. • Go down hall and ask • Big Data • Too many people to ask • Lose provenance • Hard to differentiate the layers Access Innovations, Inc.. • Not by size of collection • By the services they offer • More services more competitive • Spreadsheet science 12/11/2013 Differentiation
  • 88. People, data usage Preservation and discovery Publishers’ content, not format Visualization of Big Data sets Reproducibility of research Shared knowledge Open peer review • • • • • Researcher.org Galaxy zoo Zooniverse 900,000 citizen scientists Wikipedia Access Innovations, Inc.. • • • • • • • 12/11/2013 New ways of doing things
  • 90. • Weeding • Redefining collections • Collection development • Skills to apply to Big Data • Vocabulary development • Search expertise • Reference ability • What to throw away • What to keep for discoverability • Need metadata to preserve • Better discoverability • Easier preservation • Keep the data provenance Access Innovations, Inc.. • Librarians 12/11/2013 Skills we bring
  • 91. Access Innovations, Inc.. • Big Data – What is it? • New Government Initiative • Content organization • Discovery (Search) • Management • Skills we bring • Examples of what we can do 12/11/2013 We covered
  • 92. • Semantic Enrichment Services Provider • We change search to find! • Creator of Data Harmony tools • More than 600 taxonomies created • More than 2000 engagements • Financed by Sweat, Persistence, and Good Cash Flow Management • Accurate, on time, under budget! Access Innovations, Inc.. • Access Innovations – 12/11/2013 Who are we?
  • 93. • Marjorie Hlava • mhlava@accessinn.com • +1-505-998-0800 • Access Innovations/Data Harmony/Access Integrity • Taxodiary.com (blog) • Taxobank.org (reference tool for taxonomists) 12/11/2013 Thank you Access Innovations, Inc.. Slides at www.accessinn.com/presentations

Notas do Editor

  1. Thanks to Helen Atkins of AACR for this illustration.The real power of this is that the links can all go in all directions, so we take advantage of having the user’s attention regardless of how they step into our “web”