A database of riches michael cairns

A Database of Riches:
Measuring the Options for Google’s Book
Settlement Roll Out

Michael Cairns – Managing Partner, Information Media Partners

michael.cairns@infomediapartners.com

Tel: 908 938 4889

A Database of Riches: Measuring the Options for Google’s Book Settlement Roll Out
Author: Michael Cairns – Information Media Partners
Page 2 of 19

Author:

Michael Cairns has been a publishing executive and consultant for over 25
years. As President, R.R. Bowker he led the team that transitioned the
company from a print-based organization to one reliant on web subscription
products, and also successfully broadened the company’s revenue base. During
his tenure at Bowker, he managed the sale of Bowker from Reed Elsevier and,
once that transaction was completed, he executed a strategic plan resulting in the
acquisition and integration of five companies in three years. As a consultant, he has
managed projects for many large media companies including Thomson Learning (Cengage),
Simon & Schuster, Reed Elsevier, The Interpublic Group of Companies, Ogilvy & Mather,
Hearst, Gruner + Jahr, Online Computer Library Center (OCLC), AARP and others. In
addition, Michael has held executive positions at PricewaterhouseCoopers, Berlitz
International, Inc., Macmillan, Inc, and MyWire.com.

In his current role at Information Media Partners, Michael consults with a wide spectrum of
publishing and media companies helping them define market opportunities, develop
business strategies, identify acquisition opportunities and manage through crisis. Potential
clients are encouraged to contact Michael for more information (tel: 908 938 4889).

Notes on this Report:

In the summer of 2009, I started to wonder at the potential market opportunity that the
Google Book Settlement could represent. Fellow industry consultant Mike Shatzkin and I
began to discuss the agreement and I agreed to pull together a spreadsheet that could
represent an ‘order of magnitude’ estimate of the market opportunity. This report does not
rely on any direct interviews with Google nor representatives of the Book Rights Registry
(BRR) and, as such, it only represents a structured approach to analyzing the opportunity.
Nor is this report a definitive declaration of pricing, market penetration or approach in the
manner in which this market opportunity may be leveraged.

In addition to this report on market opportunity, I also constructed an estimate of the
potential size of the orphan works population. This material has been available for some
time on my blog (personanondata) and in several presentations I have made. I have
included this analysis as an attachment to this report. (Other than a few minor punctuation
edits, there have been no changes to my original).

Several people helped in the review of this document and, for their time and effort, I am
especially grateful. A special thanks to Mike Shatzkin of The Idea Logical Company who
originally prompted me to look at the market potential of the Google Book Settlement and
helped me organize my thoughts.

Both OCLC’s WorldCat and Bowker’s Books In Print were invaluable in developing some of
the conclusions formulated in this document. Specific citations are noted where applicable.

Readers of this report may be interested in discussing the findings with me directly and in
more detail. Please contact me to arrange a time: michael.cairns@infomediapartners.com
or 908 938 4889. Find me on LinkedIn, Twitter and Scribd.

Copyright: Michael Cairns – Replication and Distribution By Permission 2

Page 3 of 19

Introduction:

Almost five years ago, Google embarked on the most ambitious library development project
ever conceived: To create a “Noah’s Ark” of every book ever published and to start by
digitizing books held by a rarefied group of five major academic libraries. The immediate
response from US publishers was muted, until the implications of the project became clear:
That Google proposed no boundaries to the digitization effort and initiated the scanning of
books both in and out of copyright and in and out of print. Adding to publisher’s concerns,
Google planned to display “snippets” (small selections) of the book’s content in search
results. Despite some hurried conversations among publishers, author groups and Google,
Google remained convinced that what they were doing represented a social ‘good’ and the
partial display of the scanned books was legally within the boundaries of fair use.

From the publisher perspective, this was a make-or-break moment, and the implications
were more acutely felt by trade publishers who saw the potential for their business models
to be obliterated by easy and ready access to high-quality content via a Google search over
which they would exert little or no control. Even worse was the fear that rampant piracy of
content would also develop – a debated and contentious point - given the easy access to a
digitized version of a work that could be e-mailed or printed at will. The publishers
determined that if Google were to ‘get away with it’ without challenge, then anyone would
be able to digitize publisher content and possibly replicate what has been going on in the
music and motion picture industries for almost ten years. In mid-2005, prompted by a law
suit filed by The Authors Guild, the Association of American Publishers (AAP) led by four
primary publishers filed suit against Google in an effort to halt the scanning of in-copyright
materials. (The Authors Guild and AAP ultimately combined their filings).

The initial Google Book Settlement (GBS) agreement, given preliminary approval by a court
in October 2008, generated a vast amount of argument both in support of the agreement
and in challenges to it. A revised agreement was drafted after the Federal District Court of
Southern New York and Judge Chin agreed to delay the adjudication and final arguments
which were heard in late February 2010. To date, Judge Chin has not given a timetable nor
an indication of when and how he will decide the case.

From the perspective of the early leading library participants, Google’s arrival and promise
to digitize their purposefully conserved print collections looked like a miracle. Faced with
forced declines in the dollars spent on monographs and the ever-rising expense of
maintaining over 100 years of print archives, the Google digitization program provided a
possible solution to many problems. All libraries believe they hold a social covenant to
collect, maintain and preserve the most relevant materials of interest to their communities
but maintaining that covenant becomes a challenge in an environment of increasing
expenses while also enduring the challenges of migrating to an on-line world.1

1
It is important to acknowledge that, initially, the GBS may have been seen as a solution to libraries’ conservation and preservation
needs; however, subsequently, libraries have determined that they need to develop their own preservation options in which The Hathi
Trust is a clear leader.


Page 4 of 19

The library world is typically segmented into public and academic institutions and while
these often varied ‘communities’ may differ in their philosophy towards, for example,
collection development or preservation, they do share some common practices. Most
importantly, all libraries are committed to resource sharing and while materials use has
historically and primarily been ‘local’ to the library, every institution wants to make its
collections available to virtually any patron and institution who requests them. In short,
these library collections were always ‘accessible’ to all regardless of geography or copyright:
First US Mail, FedEx, e-mail and then the Internet progressively made this sharing easier
but, until Google arrived with their digitization program, any sharing beyond the local
institution was via physical distribution2. In effect, it could be argued that the Google
scanning program simply makes an existing practice vastly more efficient.

Even though, the approval of the Google Book Settlement (GBS) hangs in the balance under
review by Judge Chin of the Federal District Court of Southern New York, an Executive
Director has been named to head the Book Rights Registry (BRR)3 and is preparing the
groundwork to establish the organization (BRR) in advance of approval. This report
represents an attempt to analyze the market size opportunity for Google as it seeks to
exploit the Google Book Settlement. Following are our summary findings which are
discussed in more detail in the ensuing pages of this report.

Summary Findings of the Report:

 Libraries will see tremendous advantages – both immediate and over time - from
the GBS, although concerns have been voiced (notably from Robert Darnton of
Harvard4)

 Google’s annual subscription revenue for licensing to libraries could approach
$260mm by year three of launch

 Over time, publishers (and content owners) will recognize the GBS service as an
effective way to reach the library community and are likely to add titles to the
service5

 Google will add services and may open the platform for other application
providers to enhance and broaden the user experience

2
Resource sharing and improvements in the ‘logistics’ provided by OCLC (WorldCat) or via consortia such as OhioLink has made
physical distribution effective and comparatively efficient.

3
The BRR is the management body tasked with administering the GBS and representing the interests of authors and publishers once
approval has been granted by the court.

4
Robert Darnton, NY Review of Books

5
The settlement doesn’t provide for adding content prior to 1/5/09; however, we are suggesting that, by mutual consent, additional
published content may be added as an expedient method of reaching the library market.


Page 5 of 19

 The manner in which the GBS deals with orphan works will provide a roadmap for
other communities of ‘orphans’ in photography, arts, and similar content and
intellectual property


Page 6 of 19

Business Analysis:

By mid-2008, the lawsuit was background noise adding to the general malaise and
discomfort characterizing the media industry and the announcement that the parties had
agreed to settle their differences was initially greeted with support, relief and some surprise.
Yet, as the implications of the complex settlement agreement became clearer, a strong
(and, at times, strident) opposition developed to argue for substantial revisions to, or the
elimination of, key sections of the agreement. Importantly, this opposition also succeeded in
enjoining the Department of Justice (DoJ) to voice ‘strong opposition’ to segments of the
agreement. When combined with the concerns expressed by DoJ, the opposition to the
agreement was able to exact significant changes to the agreement’s terms. A ‘revised
agreement’ was presented to and is now pending approval by Judge Denny Chin of the
Federal District Court of Southern New York.

Among the principal arguments against approval of the original settlement agreement were
the following:

• Opponents argued Google would attain an insurmountable monopoly over in-
copyright but out-of-print works

• The obligation to ‘opt-out’ of the agreement places an undue burden on the copyright
holder (author)

• Foreign rights holders were under represented (or insufficiently consulted) and thus
disadvantaged by the original agreement

• Monies collected on behalf of copyright holders but never disbursed would be paid
into a ‘general expenses’ fund to benefit the Books Rights Registry6

• Some authors believed their moral rights to determine the use and replication of
their works were circumvented.

• The agreement itself will in effect create copyright ‘legislation’ which should be the
purview of Congress

The revision to the agreement has partially addressed these issues (excepting the last item)
but the settlement revision has not fully incorporated all of the challenges supported by the
settlement opposition and the Department of Justice.

Two aspects of the agreement which generated attention and hyperbole concerned the
number of “orphan works” and the revenue model Google would implement to market their
full-text database. Both of these issues are used by settlement opponents to justify the
agreement’s rejection by the Court. In each case, very little real analysis has been

6
Changed in the second version of the settlement so that uncollected funds would eventually be distributed to designated charities.


Page 7 of 19

conducted to determine the true parameters of both the ‘orphan’ issue and the market
opportunity.

In August 2009, we published an estimate of the potential number of orphan works that
may exist. We are unaware of any other detailed analysis that attempts to quantify the
collection of titles which remain in copyright but whose copyright holder has not been
located. This analysis is included as an attachment to this document7. The following chart
summarizes the findings of potential orphan works:

Estimate of Percent of
Orphan Title Output:
Works 1920 – 2000

580,388 Base Case 24%

824,553 High/Aggressive 34%

In summary, the orphan analysis estimated a potential orphan population of 580,388 based
on a review of pre-existing statistical information documenting the numbers of new titles
published in the US since 1920. While we estimated that ‘orphans’ would be more prevalent
among older titles, the total annual title output only exceeded 15,000 for the first time in
1960 (according to our source data); therefore, the universe of all titles published between
1920 and 1980 is actually relatively small. Publishing output only rapidly increased during
the late 1980s and it is assumed that the majority of these titles will not be ‘orphans’
because copyright information is readily available and confirmable. As noted, the full report
is included as an attachment to this report. We believe our analysis to be sound and the
results were supported by a different methodology based on data from OCLC’s WorldCat
database (as noted in the full report).

After estimating the total number of ‘orphans’ we also estimated the number of foreign
works that could potentially be included in the GBS. This analysis is more tenuous
statistically because we relied entirely on the OCLC WorldCat database8 and made several
key assumptions and extrapolations. Based on this conditional estimate, we determined
there could be approximately 1.2million titles from the ten largest languages published and
an additional 0.2million from all other languages.

Currently, the content potentially covered by the GBS represents over 12mm titles scanned.
Multiple versions of the same work are included in this total; however, even if all foreign
works are to be excluded from the database and authors and publishers voluntarily remove

7
A related analysis that extrapolates the potential number of foreign language titles that may fall under the umbrella of the settlement
has also been completed but is not included in this document.

8
This is not to assert that the WorldCat data is inaccurate in any way; rather, our assumptions should be considered ‘best-guess’.


Page 8 of 19

their titles from inclusion, the Google Book subscription product will remain a compelling
database for the academic and public library market as well as schools and certain
corporations. A significant change adopted in the amended settlement agreement has
narrowed the class to UK, Australian and Canadian published books in addition to those
registered with the US copyright office.9

The Google Books Database Subscription and Revenue Model

Opponents have suggested that Google will be in a position to exercise monopolistic pricing
and to ‘overcharge’ to extract maximum revenues from their customers. We agree that their
market position could be abused; however, we believe there is a counter-balance included
in the agreement that obviates this tendency. Google seeks maximum exposure for the
content - not only to support its stated mission of providing wide and broad access to this
‘hidden’ content, but also to support other business opportunities they may implement
(such as advertising programs). We believe Google will see overly aggressive pricing as an
inhibitor to wide market acceptance of the product. The Book Rights Registry will represent
the interests of authors and publishers who will argue for pricing that maximizes their
opportunity. Together, balancing wide access (Google’s position) with pricing considerations
will result in an optimal pricing matrix.

In developing our financial and market analysis, there are several key assumptions we have
relied upon10:

• Pricing will be variable based on type of institution

• This will be considered a ‘must have’ database product for all libraries

• The Google product will effectively “level the playing field” from small to large
academic libraries for the types of books covered by the Settlement

• Google will continue to invest in the Book database product by adding content,
functionality and applications/tools to aid usage over time and may raise pricing

• Penetration will not reach 100% for any segment, but is likely to grow over time

• Corporations will be important customers (e.g., science, aeronautics and
engineering-based firms)

9
As an upper limit, the number of ‘non-English’ language titles could be 50% of the total books scanned.

10
Business models that include advertising are not assumed in this analysis. It may be possible that Google will use the scanned
content as content around which they can tailor advertising offers; however, the second amended version has narrowed the
application of varied business models and it is difficult to determine that any model other than a subscription-based service will be the
primary revenue generator to Google and the BRR. Over time, this may change but that circumstance is not anticipated in this
analysis.


Page 9 of 19

In the following analysis, we attempt to define the Google Books Database market
opportunity and estimate the potential annual revenues the company may be able to
generate each year from database subscriptions. Google currently markets several services
to publishers which include Google Scholar, Google Partner Program and Google Editions
(which will be launched in mid-2010). These current products and services are not included
or assumed in this analysis.

In estimating the market potential for the Google Settlement database product, we have
taken three primary components (or drivers) into account: Market segmentation,
penetration and pricing.

Market Segment

The agreement provides Google with the right to exploit certain markets including
academic, public and special libraries, corporate customers, print-on-demand (POD)11 and
direct-to-consumer sales. In our analysis, we have used American Library Association data
itemizing the type and number of libraries in the US and used “best guess” estimates of the
market opportunity represented by corporations and consumers. Most commentary to date
has focused on the library community, which is where this analysis is strongest in its
estimates and where we concentrate our discussion.

An important accommodation of the Settlement is the provision of free access to the
database product for all public libraries and certain “Carnegie” classed libraries. Each library
accepting this access will receive the equivalent of a single user sign-on that will allow
patrons and/or staff to access the Settlement database without restriction. While an
important accommodation for some libraries, for the majority of libraries this access will not
be appropriately functional and, thus, site-wide and unlimited user access provided under
the terms of the subscription product will remain the better option. We do not believe this
free access will materially impact the revenue opportunity for Google and have allowed for
this circumstance in our financial model.

In our opinion, academic libraries will consider a subscription to the Google Books database
as a competitive necessity. For the first time, any subscribing library within the United
States may gain direct access to the collections of some of the largest and most renowned
academic collections in North America12. In addition, this access will far surpass the inter-
library loan process of years past simply because the content is completely indexed.
Researchers will no longer have to ‘guess’ that a title may be relevant to their research
based on an index or table of contents and, moreover, they eliminate the risk that upon
requesting the title be delivered to them, they discover the content to be irrelevant.

11
POD is a right that may be granted to Google in the future pending approval of the Book Rights Registry and the rightsholders they
will represent.

12
The amended settlement has narrowed the class and effectively excludes non-English titles from the database.


Page 10 of 19

Many academic library collections have been built over centuries and titles in their
collections are often unique, which is another compelling reason supporting the argument
that the Google database represents a singular opportunity for all academic institutions to
“narrow the gap” between their research capabilities and those of the country’s largest and
best endowed institutions. While some academic collections’ titles are available via inter-
library loan, many older, fragile and unique works are only available at the institution itself
by special request. The digitization of many (not all) of these works significantly broadens
access to and distribution of this content. Undoubtedly, researchers, educators and students
at all academic institutions will pressure their administrators and librarians to subscribe to
the product13.

The following chart represents our construct for the potential addressable market segments
for the Google book database14:

Total Number of Academic Libraries 3,617
Total Public Libraries 9,198
School Libraries 99,783
Special Libraries 9,066
Armed Forces 296
Government 1,159

Market Penetration:

We estimate that sales penetration will vary considerably across the segments; however, for
the reasons presented earlier, we believe penetration into the academic library segment will
lead all other markets. Public libraries (particularly metropolitan library systems) will find
value in the database and, as a group, will represent the largest concentration of customers
overall. School libraries are unlikely to subscribe to the database in great numbers for
budgetary or relevance reasons and, moreover, students will be encouraged to gain access
to the product via their public library remote-access facilities.

We expect larger research public libraries (such as The New York Public Library) will be
treated as academic libraries for the sake of pricing. We also expect some corporations to
access the database product and, while pricing for these ‘for profit’ entities should be
comparatively high, the absolute number of customers in this segment will be small.

Pricing:

Database subscription pricing can be complicated and confusing. Models can be based on
population served, purchasing budgets and/or enrollment, and then be subject to

13
It is likely that an extensive database of user behavior maybe generated by usage of this database. This is data that publishers (and
authors) may be interested in mining for product development and/or insights into consumer behavior.

14
Source: American Library Association


Page 11 of 19

multiplication factors such as number of simultaneous users, number of physical locations
and other factors. We don’t know which method Google will choose; however, in order to
keep our analysis as simple and transparent as possible, we have built our pricing model on
the basis of the following criteria:

• Unlimited users per location

• Branch public libraries priced at 25% of base fee per additional branch

• 3% price increases per year

• Institution ‘classification’ based on ALA data

• Full ramp-up will occur over the first three years

Additionally, we expect Google will sell to the ‘highest’ administrative level possible15. For
example, the University System of Georgia manages licensing contracts under their Galileo
program for both public and academic libraries and, therefore, this agency would be the
customer rather than individual or local libraries. In New York, Google would license access
to the library authorities in each borough. In New York City (Manhattan), this would mean
the main library and roughly 50 satellite libraries would have unlimited access via one
contract and, based on our pricing matrix, the NYPL would pay approximately $340,000 per
year for access ($25,000 for the main and $6,250 per 50 locations)

For-profit organizations (corporations and businesses) will have a pricing matrix higher than
for non-profit libraries and institutions (generally standard practice). We would expect that
only a relatively small percentage of businesses would subscribe to the entire database and
we have segmented the target market into Fortune 500, 1,000 and all others. The corporate
customers most likely to subscribe would be those companies with large research needs
such as pharmaceutical, aeronautics, engineering and the like. Options to better address
this market may include shorter subscription terms, usage based on metering systems or
topic/subject specific packages.

Market Opportunity Summary:

We believe Google and the Book Rights Registry (a proxy for authors, authors’ heirs and
publishers) will be motivated to maximize access to the Google database in order to
maximize viewing of the content which will, in turn, result in optimal revenues for both. We
do not believe Google will implement a monopolistic approach to pricing and, in comparison
with smaller and more segmented databases, we believe the Google pricing will appear
reasonable considering the breadth and depth of content in the database.

Approach to the Market:

15
Consortia pricing, while an important consideration, would represent a discount to the pricing matrix we present and would be
negotiated on a case-by-case basis. We have not made accommodations for Consortia pricing.


Page 12 of 19

In our view, Google has several options for marketing and selling this database product:

• Google sells the product themselves with their own sales force

• Google designates one supplier for each segment

• Google allows all vendors to integrate the books database product into their existing
database products and pays Google a defined fee per user.

In our view, it is unlikely that Google will establish their own sales force to sell into the
library and corporate marketplaces. While Google does have an ad sales force supporting its
SEM program(s), this activity is vastly different from building a sales team to call on
libraries and corporate clients. Additionally, given Google’s predilection for automation, the
hiring of a human sales team doesn’t seem culturally acceptable. Lastly, and possibly more
important, we believe licensing this product will become more a ‘renewal’ business as the
market matures (after 3-4yrs) which could require far less sales effort – or one significantly
different than that required in the first three years. We estimate a fully staffed Google sales
force could cost the company $15million annually but, in short, Google is unlikely to want
the headache.

Given the limitations of the above approach, we believe it is more likely Google will contract
with one or more of the established players and pay a standard sales commission to the
provider. In this model, Google will be able to set prices and targets and retain a degree of
control over both the provider of this sales effort and the market delivery (pricing) of the
product. Existing providers would bid on the right to sell this database on behalf of Google
and, because the product will be highly valued, the bidding would likely be highly
competitive. Likely providers to Google would include ProQuest, Gale/Cengage, OCLC or
EBSCO. It is also possible that an ‘outlier’ such as Ingram, Baker & Taylor or Hudson News
(LibreDigital) would also see representing this database as a significant opportunity. For an
established player, it is likely the provider would see increased sales in their current offering
– simply representing the Google Books database would open new market opportunities. For
an ‘outlier’, the Google Books product may represent an opportunity to enter the market
using the Google product as a foundation.

In our estimation, the above scenario is not only practical (not having to administer their
own sales force is a major advantage), but may also be cost effective. Given the ‘prize’ of
representing the Google database, we believe the average cost to Google maybe less than
10% of revenues. (“Renewal” sales may also be commissioned less than initial sales).

Working with a single provider thus represents an effective solution for Google but this
strategy may not also be efficient. In order to achieve greater efficiency in reaching their
target market while also eliminating possible “political” issues caused by selecting one
vendor over the others, the company may consider allowing any provider to sign a standard
distribution agreement with the company and sell and market the product into all markets.
This approach has several advantages:


Page 13 of 19

• Immediately leverages the competitive position of all major providers that otherwise
may be mutually exclusive

• Gives a library subscriber a choice of provider and/or allows them to work with an
existing ‘preferred’ vendor

• Potentially enables providers to integrate the Google product with their existing
products thus providing rapid development initiatives and built-in content ‘handcuffs’
supporting renewals

• Minimizes Google’s exposure to any supplier limitations and negative customer
support issues

• Provides maximum exposure to all market segments virtually immediately

• As part of these agreements, Google may gain access to index all content supplied
by their third-party sales partners

Approach to the Market Summary:

Based on this review of Google’s tactical options, we believe the company will enable
multiple (initially ‘preferred’) vendors to market and sell the product into the market.
Google will establish pricing and the vendors will be required to pay Google based on this
set price schedule (less vendor commission). Under this model, any vendor will be free to
charge the end-customer less than the ‘set price’; however, the vendor would still pay
Google based on the higher ‘full’ price. (Selling below the set price could occur due to
bundling different products provided by the vendor).

Forecasted Revenue Expectations:

Based on our assumptions documented above, we believe the revenue Google may
generate from the Google Books database product could approach $260million per year. Our
revenue model was based on the following set of assumptions:

• Base pricing by segment

• Price discounts based on size of library holdings or population served

• Penetration levels based on library size

• Revenue represents full implementation, which we expect by year three


Page 14 of 19

The following chart documents our estimates:

Total Avg. Revenue
Segment Avg. Pricing
Market Penetration ($MM)
Academics 3,617 65% $55,000 $130.1
Publics 9,198 47% $21,000 $112.8
School 99,783 0.5% $10,000 $4.9
Special 9,066 0.5% $25,000 $1.1
Armed
296 5% $11,000 $0.1
Forces
Government 1,159 25% $11,000 3.1
Corporate 100,000 2% $37.500 $7.5
Total $260.0

As noted, we believe it will take Google three years to ramp up this full implementation
revenue (we do not see this as a limitation on Google’s part, rather, a typical expectation
for a new-product roll out). At the above levels, we believe pricing is not only reasonable
and affordable, but compares favorably with existing database publishers’ pricing. There are
few, if any, other publishers who have products which serve as many (all) segments as the
Google Book database.

At this revenue level, each of the 12mm titles in the Google database has a nominal value
of $22 (per year) to Google. More importantly, the per-unit price paid by each library will be
less than $0.05 (five cents). On a pure cost-avoidance basis, licensing the Google Books
database appears good value given current costs. If the costs of handing, cataloging, special
requests (such as interlibrary loans) and storage are added to the base wholesale price of
any title, the title’s full ‘carrying costs’ can double. Some studies have indicated that
fulfilling an interlibrary loan request can cost $25 for each segment from the library to
requestor and back. This cost far exceeds the original (or, in many instances, the
replacement) cost of the title16.

While we believe this database to be an important acquisition for most academic and many
public libraries, we do expect that Google will need to sell this product aggressively in the
early years to achieve the penetration levels we anticipate. There are several reasons for
this: Firstly, the content of the database is largely unknown and, while representative of
many important library collections, Google will need to market this collection as important
and complementary to the library customers in question. Secondly, the sheer size of the
database could be an inhibiting (or intimidating) factor and therefore the navigation,

16
Users may print all or portions of the titles they select – although the ability (functionality) to do this may be a subsequent grant
provided by the BRR to Google – and there is a cost to these activities;; however, we maintain the utility of the database and the ability
of the user to be precise in their printing requests will thus produce only a marginal negative cost (if any) relative the costs of
avoidance that is endemic to the current solution.


Page 15 of 19

bibliographic data quality and the delivery of subject ‘collections’ will be important customer
acquisition and retention areas for the company to focus on.

In summary, we believe Google will be able to successfully launch their Book Database
product into the market with fair and reasonable pricing that will encourage a broad base of
target customers to subscribe.

Future Market Growth Opportunities:

While launch of this product is a focus of attention, we do believe the company has
numerous opportunities to expand the product over time. We do not expect the Google
Books database product to ‘stand still’; rather, we believe this product could become the
primary access point for textural (monograph) materials into the library market.

Future market opportunities17:

• The addition of other content: Publishers may see this product as a viable library
market entrance point for all their book content

• Provision of usage data to publishers (and others) for business and product
development needs

• Pricing increases over time and penetration will increase

• Inclusion of international/non-US market content – English language

• Inclusion of international/non-US market content – Non-English language

• Access to international markets

• Addition of more in-copyright materials closer to current pub dates; perhaps
becomes a major distribution mechanism for book content

• Topic/segmented collections

• Potential to open the database for third party application development

17
We expect these opportunities to ‘evolve’ over time based on discussion, negotiation and mutual agreement of the parties.


Page 16 of 19

Summary:

This analysis argues that the Google Books Database product will be seen as a ‘must have’
product for a large proportion of academic and public libraries and is, thus, valuable on its
merits. Google will price this product at levels both lower than existing database providers
and at levels that are ‘economically viable’ given cost avoidance justifications. The company
retains flexibility in how they will approach selling and marketing the product; however, we
believe they will contract these services. Lastly, we believe there is potential upside to the
revenue model based on adding new markets and expanding content.


Page 17 of 19

Addendum A – Orphan Works Analysis

580,388 Orphans (Give or Take)

Clearly one of the most (if not the most) contentious issue regarding the Google Book
Settlement (GBS) centers on the nebulous community of “orphans and orphan titles”. And
yet, through the entirety of the discussion since the Google Book Settlement agreement was
announced, no one has attempted to define how many orphans there really are. Allow me:
580,388. How do I know? Well, I admit, I do my share of guess work to get to this
estimate, but I believe my analysis is based on key facts from which I have extrapolated a
conclusion. Interestingly, I completed this analysis starting from two very different points
and the first results were separated from the second by only 3,000 works (before I made
some minor adjustments).

Before I delve into my analysis, it might be useful to make some observations about the
current discussion on the number of orphans. First, when commentators discuss this issue,
they refer to the ‘millions’ of orphan titles. This is both deliberate obfuscation and lazy
reporting: Most notably, the real issue is not titles but the number of works. My analysis
attempts to identify the number of ‘works’; titles are a multiple of works. A work will often
have multiple manifestations or derivations (paperback, library version, large print, etc.)
and, thus, while the statement that there may be ‘millions of orphans titles’ may be partially
correct, it is entirely misleading when the true measure applicable to the GBS discussion is
how many orphan works exist. It is the owner (or parent) of the work we want to find.

To many reporters and commentators, suggesting there are millions of orphans makes
sense because of the sheer number of books scanned by Google but, again, this is laziness.
Because Google has scanned 7-10 million titles then, so the logic goes, there must be
‘millions of orphans’. However, as a 2005 report (which I understand they are updating) by
OCLC noted, many definitional disclaimers are applied to this universe of titles such as titles
in foreign languages, titles distributed in the US, titles published in the UK, to name a few.
Accounting for these disclaimers significantly reduces the population of titles at the core of
this orphan discussion. These points were made in the 2005 OCLC report (although they
were not looking specifically at orphans) when they looked at the overlap in title holdings
among the first five Google libraries. (And, if you like this stuff, this was pretty interesting).
Prognosticators unfamiliar with the industry may also believe there are millions and millions
of published titles since, well, there are just lots and lots in their local B&N and town library.

The two methods I chose to try to estimate the population of orphans relied, firstly, on data
from Bowker’s BooksinPrint and OCLC’s Worldcat databases and, secondly, on industry data
published by Bowker since 1880 on title output. I accessed BooksinPrint via NYPL (Bowker
cut off my sub) and Worldcat is free via the web. The Bowker title data has been published
and referred to numerous times over the years and I found this data via Google Book
Search; I also purchased an old copy of The Bowker Annual from Alibris.

In using these databases, my goal was to determine whether there are consistencies across


Page 18 of 19

the two databases that I could then apply to the Google title counts. In addition to the ‘raw
data’ I extracted from the databases, OCLC (Dempsey) also noted some specific numbers of
‘books’ in their database (91mm), titles from the US (13mm) and non-corporate ‘Authors’
(4mm). Against the title counts from both sets of data, I attributed percentages which I
then applied to the Google universe of titles (7mm). (My analysis also 'limits' these numbers
to print books excluding, for example, dissertations).

In order to complete the analysis to determine a specific orphan population, I reduced my
raw results based on “best guess” estimates for non-books in the count, public domain titles
and titles where the copyright status is known. These final calculations result in a potential
orphan population of 600,000 works. I also stress-tested this calculation by manipulating
my percentages resulting in a possible universe of 1.6mm orphan works. This latter
estimate is (in my view) illogical, as I will show in my second analysis.

An important point should be made here: I am calculating the potential orphan population,
not the number of orphans. These numbers represent a total before any effort is made to
find the copyright holder. These efforts are already underway and will get easier once
money collected by the Books Rights Registry is to be distributed.

My second approach emanated from a desire to validate the first approach. If I could
determine how many works had been published each year since 1924, then I could attribute
percentages to this annual output based on my estimate of how likely it was that the
copyright status would be in doubt. Simply put, my supposition was that the older the work,
the more likely it was that it could be an orphan.

Bowker has consistently calculated the number of works published in the US since 1880
(give or take) and the methodology for these calculations remained consistent through the
mid-1990s. According to their numbers, approximately 2mm works were published between
1920 and 2000. Unsurprisingly, a look at the distribution of these numbers confirms that
the bulk of those works were published recently. If there were (only) 2mm works published
since the 1920s, it is impossible to conclude there are millions of orphan works.

To complete this analysis, I aggressively estimated the percentage of works published each
decade since 1920 which could be orphan works. The analysis suggests a total of 580K
potential orphan works which, as a subset of the approximately 2mm works published in the
US during this period, seems a reasonable estimate. My objective to ‘validate’ my first
approach (using OCLC and BIP data) shows that both approaches, using different
methodology, reach similar conclusions.

There are several conclusions that can be drawn from this analysis. Firstly, since the
universe of works is finite then, beyond a certain point, the Google scanning operation will
begin to find ‘new’ orphans at a decreasing rate. I don’t know if this number is 5mm
scanned titles or 12mm; my estimate is 7mm because, according to Worldcat, there are
3mm authors to 12mm titles. If you apply this ratio to the Bowker estimate of total of works
published, the number is around 7-8mm titles. Secondly, publishing output accelerated in


Page 19 of 19

the latter part of the 20th century. While my estimates in percentage terms of the number
of more recent orphans were comparably lower than the percentages applied in the early
part of the century for ‘older orphans’, the base number of published titles is much higher,
therefore the number of possible orphans is higher. Common sense dictates that it will be
far easier to find the parents of these later ‘orphans’.

In the aggregate, the 600K potential orphans may still seem high against a “work”
population of 2.2mm (25%). I disagree, given the distribution of the ‘orphan’ works (above
paragraph) and because I have assumed no estimate of the BRR’s effort to find and identify
the parents. In my view, true orphans will be a much lower number than 600,000, which
leads me to my final point. Money collected on behalf of unidentified orphan owners will
eventually be disbursed to cover costs of BRR or to other publishers. There has been some
controversy on this point and it derives, again, from the idea that there are millions of
orphans and thus the pool of undisbursed revenues will be huge. The true numbers don’t
support this conclusion. There will not be a huge pool of royalty revenues to be ultimately
disbursed to publishers who don’t ‘deserve’ this windfall because there won’t be very many
true orphans. The other point here is that royalty revenues will be calculated on usage and,
almost by definition, true orphan titles for the most part are not going to be popular titles
and therefore will not generate significant revenues in comparison with all other titles.

This analysis is not definitive, it is directional. Until someone else can present an argument
that examines the true numbers and works in more detail, I think this analysis is more
useful to the Google Settlement discussion than referring by rote to the ‘millions of
orphans’. The prevailing approach is lazy, misleading and inaccurate.


A database of riches michael cairns

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a A database of riches michael cairns

Semelhante a A database of riches michael cairns (20)

Mais de Michael Cairns

Mais de Michael Cairns (18)

A database of riches michael cairns