From the MER Conference 2012
Seakers: Jason R. Baron, Esq. Dave Lewis, Ph.D.
2012 is the year we will see great strides by information professionals in using automation (in the form of "predictive" and "technology-assisted" search, filtering, and auto-classification) for the purpose of achieving efficiencies and cutting costs in records management as well as in legal settings.
The strategic use of these new methods is absolutely necessary given the massive, exponential increases in electronically stored information - in the form of records within corporate networks and repositories.
This session addresses the latest technological developments from the two perspectives:
- A longtime advocate of smart technology in the public recordkeeping sector, and
- A leading information scientist.
The session includes a state of the art overview of the latest developments in technology-assisted review, with an emphasis on how these technologies can and will enhance electronic records management by helping to end the era of excessive reliance on end user RM.
You will learn:
- What technology-assisted review and predictive analytics are all about using advanced search, filtering, and auto-classification as part of a defensible electronic records management program.
- How these technologies also add value to overall corporate information governance.
M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?
1. Cohasset Associates, Inc.
NOTES
Will Technology-Assisted Predictive Modeling and Auto-
Classification End the ‘End-User’ Burden in
Records Management?
2012 Managing Electronic Records Conference
Chicago, IL
g
May 7, 2012
Jason R. Baron, Esq.
Director of Litigation
Office of General Counsel
National Archives and Records Administration
Dave Lewis, Ph.D.
David D. Lewis Consulting, LLC
Chicago, IL
A New Era of Government
“[P]roper records management is the backbone of open Government.”
President Obama’s Memorandum dated November 28, 2011
re “Managing Government Records”
http://www.whitehouse.gov/the-press-office/2011/11/28/presidential-memorandum-
managing-government-records
2012 Managing Electronic Records Conference 6.1
2. Cohasset Associates, Inc.
NOTES
Reality:
The era of Big Data has just
begun….
Lehman Brothers Investigation
-- 350 billion page universe (3 petabytes)
-- Examiner narrowed collection by selecting
key custodians, using dozens of Boolean
searches
-- Reviewed 5 million docs (40 million pages
using 70 contract attorneys)
Source: Report of Anton R. Valukas, Examiner, In re Lehman Brothers Holdings Inc., et al., Chapter 11
Case No. 08-13555 (U.S. Bankruptcy Ct. S.D.N.Y. March 11, 2010), Vol. 7, Appx. 5, at
http://lehmanreport.jenner.com/.
Process Optimization Problem 1: The
transactional toll of user-based
recordkeeping schemes (“as is” RM)
5
…. and the need for
better, automated solutions ….
6
2012 Managing Electronic Records Conference 6.2
3. Cohasset Associates, Inc.
NOTES
Impact of Technology on E-Records
Management: Snapshot 2012 (“As is”)
A universe of proprietary products exists in the
marketplace: document management and
records management applications (RMAs)
DoD 5015.2 version 3 compliant products
However, scalability issues exist
Agencies must prepare to confront significant
front-end process issues when transitioning to
electronic recordkeeping
Records schedule simplification is key
7
RM wish list for 2012….
RM’s “easy button”: the elusive goal of zero
extra keystrokes to comply with RM
requirements (capture)
A technology app that automatically tags
records in compliance with RM policies and
practices (categorize)
Supervised learning RM with minimal records
officer or end user involvement (learn)
Rule-based and role-based RM
Advanced search 8
Electronic Archiving As The
First Step
What is it?
100% snapshot of (typically) email, plus in some
cases other selected ESI applications
How does it differ from an RMA?
Goal is of preservation of evidence, not records
management per se
NARA Bulletin 2008-05
9
2012 Managing Electronic Records Conference 6.3
4. Cohasset Associates, Inc.
NOTES
A Possible Path Forward?
Email archiving in short term, synced to existing
proprietary software on email system
Designation of key senior officials as creating
permanent records, consistent with existing records
schedules
Additional designations of permanent records by
agency component
“Smart” filters/categorical rules built in based on
content, to the extent feasible to do
Default are records in designated temporary record
buckets, disposed of under existing records
schedules.
10
A pyramid approach combines disposition policy with automated
tools to bring FRA email under records
management, preservation, and access
= permanent or top
= temporary or staff and support
officials
slider
The position of the “set-point” for email capture depends on policy and resources:
setting it higher allows use of tools now available to get 100% of email at lower
volumes;* setting it lower means more records will be captured and smarter tools
are needed to distinguish and disposition temporary- and non-record.
Implementing an email archiving policy is feasible now, since tools are readily
available to capture 100% of email traffic at the individual or organizational level, in
formats that can be archived.
A pyramid approach combines disposition policy with automated
tools to bring FRA email under records
management, preservation, and access
= permanent or top
= temporary or staff and support
officials
slider
The position of the “set-point” for email capture depends on policy and resources:
setting it higher allows use of tools now available to get 100% of email at lower
volumes;* setting it lower means more records will be captured and smarter tools
are needed to distinguish and disposition temporary- and non-record.
Implementing an email archiving policy is feasible now, since tools are readily
available to capture 100% of email traffic at the individual or organizational level, in
formats that can be archived.
2012 Managing Electronic Records Conference 6.4
5. Cohasset Associates, Inc.
NOTES
How To Avoid A Train Wreck
With Email Archiving….
Capture E-mail But Utilize Records Management!
13
Functional Requirements for
Categorization Products in the Federal
workplace
Ease of use …. Scalability …. Archiving in native
formats….. Metadata preservation … Seamless integration
with existing software apps …. Versioning …. Compatibility
with big bucket records schedules …. Advanced search
capabilities …. Ease of training / machine learning using
records officers or end users …. Cost
Process Optimization Problem 2: The
Coming Age of Dark Archives (and the
inability to provide access)
15
2012 Managing Electronic Records Conference 6.5
6. Cohasset Associates, Inc.
NOTES
Emerging New Strategies:
“Predictive Analytics”
Improved review and case
assessment: cluster docs
thru use of software with
minimal human
intervention at front end to Slide adapted from Gartner
Conference 16
code “seeded” data set June 23, 2010 Washington, D.C.
Language Processing
Technologies
Retrieval / Search 2.
Information Classification 1.
Retrieval
Question Answering
Summarization
Entity Recognition
Information Extraction Natural
Language
Machine Translation Processing
:
17
Text Classification
Deciding which of
several groups a text
belongs to
Crudest form of
language
understanding...
...but often can be automated
with high accuracy
18
2012 Managing Electronic Records Conference 6.6
7. Cohasset Associates, Inc.
NOTES
Why Classify?
...to specify
Reduce an action for
...to finite
infinite every
set of
variety of possible
classes...
text... input.
19
Other Advantages of Text
Classification
Supervised learning:
Classifiers (rules) can be
learned by imitating manual
classifications
Straightforward numerical
measures of quality recall: 85% +/- 4%
precision: 75% +/- 3%
Objective reason why a
decision was made classification
rule
20
Variations on Classification
Binary vs. multiclass
Hierarchical
Probabilistic 83% 17%
Graded / ordered / fuzzy
21
2012 Managing Electronic Records Conference 6.7
8. Cohasset Associates, Inc.
NOTES
Defining Sets of Classes
Tradeoff among
Ideal classes to
implementpolicy
Classes you can teach
people to assign
Classes you can
?
teachsoftwareto assign
Be skeptical of automatic
discovery of classes
22
Text Retrieval Systems
AKA search engines,
semi-structured
databases, text
databases, etc.
databases etc
23
Classification Search
autonomous interactive
long term transitory
organizational personal
structured independent ? ?
?
24
2012 Managing Electronic Records Conference 6.8
9. Cohasset Associates, Inc.
NOTES
Some Distinctions Among
Search Approaches
Exact Match vs.
Ranked Retrieval vs.
"Concepts"
Browsing vs.
"Keywords"
"Keywords"
Text Representations
Matching Aids
25
Exact Match Search
Query specifies conditions
document must meet budget AND Knoxville
AND (revised or preliminary)
Variants
Boolean
B l
SQL
Faceted
Often (ambiguously) called
"keyword" search
26
A Faceted Search Interface
27
2012 Managing Electronic Records Conference 6.9
10. Cohasset Associates, Inc.
NOTES
Ranked Retrieval
Query specifies important
attributes of desired
documents
System statistically weights
those attributes
Results returned in order of
strength of match
28
Statistical Evidence in Ranked
Retrieval
Corpus statistics
Word (and metadata) counts
Unsupervised learning
Clustering, LSI/LSA etc.
Cl t i LSI/LSA, t
finds (maybe useless) patterns
Supervised learning
aka "relevance feedback"
learn indicators of user interest
29
Browsing
Hierarchies
Networks
Clusters
Spaces / Maps / Dimensions
make great pictures / demos
unclear if useful for finding information
30
2012 Managing Electronic Records Conference 6.10
11. Cohasset Associates, Inc.
NOTES
Visual Analysis Examples
(Presentation by Dr. Victoria Lemieux, Univ. British Columbia,
at Society of American Archivist Annual Mtg. 2010, Washington, D.C.)
With acknowledgments to Jeffrey Heer, Exploring Enron, http://hci.stanford.edu/jheer/projects/enron/,
Adam Perer, Contrasting Portraits, http://hcil.cs.umd.edu/trs/2006-08/2006-08.pdf, 31
and Fernanda Viegas, Email Conversations, http://fernandaviegas.com/email.html
32
2012 Managing Electronic Records Conference 6.11
12. Cohasset Associates, Inc.
NOTES
What Evidence Can The
Search Software Use?
Words, phrases, etc.
Manually assigned categories
Metadata
Author, organization, creation date, change
date, access date, length, file type,...
Contextual information (links,
attachments,...)
34
What Resources Aid
Matching?
Linguistic analysis
At word level or higher
Clusters / spaces / ...
Thesauri / semantic nets /
concept maps / ...
Suited to your task?
Modifiable?
How is text determined to
belong to category?
35
Concepts v. Keywords
Supreme Court of Information Retrieval, Case No. 1-tfidf-0-2902, 2009
Search software marketing:
Them = keyword search = bad
Us = concept search = good
Reality:
R lit
Both terms have referred to dozens of
different technologies...
...including some of the same ones!
Conceptual search is an aspiration, not
a technology
36
2012 Managing Electronic Records Conference 6.12
13. Cohasset Associates, Inc.
NOTES
Example of Boolean search string
from U.S. v. Philip Morris
(((master settlement agreement OR msa) AND NOT (medical
savings account OR metropolitan standard area)) OR s. 1415
OR (ets AND NOT educational testing service) OR (liggett
AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi
AND NOT presidential management intern) OR pm usa OR
rjr OR (b&w AND NOT photo*) OR phillip morris OR batco
OR ftc test method OR star scientific OR vector group OR
joe camel OR (marlboro AND NOT upper marlboro)) AND
NOT (tobacco* OR cigarette* OR smoking OR tar OR
nicotine OR smokeless OR synar amendment OR philip
morris OR r.j. reynolds OR ("brown and williamson") OR
("brown & williamson") OR bat industries OR liggett group)
37
U.S. v. Philip Morris E-mail Winnowing
Process
20 million 200,000 100,000 80,000 20,000
email hits based relevant produced placed on
records on keyword emails to opposing privilege
terms used party logs
(1%)
A PROBLEM: only a handful entered as exhibits at trial
A BIGGER PROGLEM: the 1% figure does not scale
38
Judicial endorsement of predictive analytics
in document review by Judge Peck in Da
Silva Moore v. PublicisGroupe(SDNY Feb.
24, 2012)
This opinion appears to be the first in which a Court
has approved of the use of computer-assisted review.
pp p
. . . What the Bar should take away from this Opinion
is that computer-assisted review is an available tool
and should be seriously considered for use in large-
data-volume cases where it may save the producing
party (or both parties) significant amounts of legal
fees in document review. Counsel no longer have to
worry about being the ‘first’ or ‘guinea pig’ for judicial
acceptance of computer-assisted review . . .
Computer-assisted review can now be considered
judicially-approved for use in appropriate cases.
2012 Managing Electronic Records Conference 6.13
14. Cohasset Associates, Inc.
NOTES
Social Networking/Links Analysis Example
From Marc Smith
Posted on Flickr 40
Under Creative Commons License
Judicial second guessing of failure to use
e-search capabilities: Capitol Records v.
MP3 Tunes, 261 F.R.D. 44 (S.D.N.Y. 2009)
“In [a prior case] the Court notes its dismay that the
party opposing discovery of its ESI had organized its
files in a manner which seemed to serve no purpose
other than ‘to discourage audits. . .’ Similarly, in this
case, [the party] host[ed] no ediscovery software on
their servers and apparently are unable to conduct
centralized email searches of groups of users
without downloading them to a separate file and
relying on the services of an outside vendor.”
41
Judicial second guessing of failure to use
e-search capabilities: Capitol Records v.
MP3 Tunes (con’t)
Court went on to add:
“The day will undoubtedly will come when
burden arguments based on a large
organization’s lack of internal ediscovery
g y
software will be received about as well as the
contention that a party should be spared from
retrieving paper documents because it had
filed them sequentially, but in no apparent
groupings, in an effort to avoid the added
expense of file folders or indices.”
42
2012 Managing Electronic Records Conference 6.14
15. Cohasset Associates, Inc.
NOTES
Problem 3: Innovative
Thinking
43
The records management world of
tomorrow….
References
Background Law Review Referencing Autocategorization&
Advanced Search
J. Baron, “Law in the Age of Exabytes: Some Further Thoughts on
‘Information Inflation’ and Current Issues in E-Discovery
Search, 17 Richmond J. Law & Technology (2011), see
http://law.richmond.edu
htt //l i h d d
Latest “Predictive Coding” Case Law to follow in blogs online:
Da Silva Moore v PublicisGroupe& MSL Group, 11 Civ. 1279
(S.D.N.Y.) (Peck, M.J.) (Opinion dated Feb. 24 2012)
Kleen Products, LLC v. Packaging Corp. of America, 10 C 5711
(N.D. Ill.) (Nolan, M.J.)
45
2012 Managing Electronic Records Conference 6.15
16. Cohasset Associates, Inc.
NOTES
Jason R. Baron
Director of Litigation
g
Office of General Counsel
National Archives and
Records Administration
(301) 837-1499
Email: jason.baron@nara.gov
46
Dave Lewis, Ph.D.
David D. Lewis Consulting, LLC
Chicago, IL
Email: consult@DavidDLewis.com
http//www.DavidDLewis.com
47
2012 Managing Electronic Records Conference 6.16