SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Page 1 of 22

      A perspective of Publicly Accessible/Open Access Chemistry Databases



                                                                   Antony J. Williams

                            ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587

                                            Phone: 919-341-8375; FAX: 919-300-5321

                               Corresponding Author:antony.williams@chemspider.com



           Keywords: Open Access Chemistry, Online databases, Internet chemistry



Teaser Sentence: Online chemistry resources are facilitating access to information

at an increasingly rapid pace offering better and faster decision making and speed up

the process of discovery.



Abstract

       The internet has spawned access to unprecedented levels of information. For

chemists, the increasing number of resources they can utilize to access chemistry-

related information provides them a valuable path to discovery of information, which

was previously limited to commercial and, therefore, constrained resources. The

diversity of information continues to expand dramatically and, coupled with an

increasing awareness for quality, curation and improved tools for focused searches,

chemists can now find valuable information within a few seconds using a few

keystrokes. Shifting to publicly-available resources offers great promise to the

benefits of science and society yet brings with it increasing concern from commercial

entities. This article discusses the benefits and disruptions associated with an

increase in publicly available scientific resources.
Page 2 of 22



       Ask a chemist how often they use the internet to search for science-related

information online and it would be fair to expect the answer would be daily. Certainly

the web now dominates as the primary portal to query for general information and

data. Yet, despite the tremendous growth in scientific internet resources, the

potential for providing useful chemistry information and data have only recently

started to be tapped. Bioinformatics has led the charge in providing online access to

data far ahead of the efforts in Chemistry. Open-access databases like GenBank and

the Protein Data Bank have been assisting biologists to translate gene and protein

sequences into biological relevance for well over two decades. Some of the

responsibility for the differences in efforts has commonly been put onto the

shoulders of publishers in Chemistry, whether for scientific articles or commercial

chemistry databases. Nevertheless, societal thrust, evangelists and group efforts are

forcing both free and open access (vide infra) to chemistry-related information.

       There are many indexes of chemistry databases online and this article is not

intended to be yet another. Rather, this article will review both the availability and

capabilities of online chemistry databases, particularly those offering free or open

access. The progress in the availability of freely-accessible information is highly

enabling and advantageous for the advancement of science and our future well-

being, but likely seen as a disruptive force for commercial bodies. This shift is

needed, not only to facilitate improved access to information in academia and

government laboratories, but also in commercial organizations feeling the pressure

of   poor   performance   in   the   discovery   and   development   processes.    The

pharmaceutical companies in particular should welcome improved access to

chemistry-related information as their business dominance is taken to task with

drugs coming off patent and no replacement blockbuster drugs in the pipeline.
Page 3 of 22

       In keeping with the web-based nature of this article, the majority of

references will be to internet resources. At the time of submission all references were

active but, as is the nature of the internet, these resources will age and may

disappear.



What is Open versus Free Access?

       There is much confusion around the differences between Open Access (OA)

versus Free Access (FA). This is also accompanied by corporate protectionism and

political battles which have found their way to Capitol Hill. Both OA and FA offer a

great opportunity to the advancement of science by sharing data, information and

knowledge as it is created. With these in place, publishing houses and institutional

repositories of information, specifically structure databases with related information,

are threatened by the potential impact on their business model and associated

revenues.

       The first major international statement on open access was the Budapest

Open Access Initiative (BOAI), in February 2002 (FAQs online). The definition of

Open Access from the BOAI frequently asked questions website is as follows: “By

'open access' to this literature, we mean its free availability on the public internet,

permitting any users to read, download, copy, distribute, print, search, or link to the

full texts of these articles, crawl them for indexing, pass them as data to software, or

use them for any other lawful purpose, without financial, legal, or technical barriers

other than those inseparable from gaining access to the internet itself. The only

constraint on reproduction and distribution, and the only role for copyright in this

domain, should be to give authors control over the integrity of their work and the

right to be properly acknowledged and cited.” With this definition in mind, one can

understand the potential impact to publishing business models. There are other

concerns in the OA model but these are reviewed elsewhere.
Page 4 of 22

       Free Access is not equivalent to OA, but, by definition, OA assumes Free

Access as part of its model. This author could not find an equivalent definition for

Free Access online but suggests the following: Free access is access that removes

price barriers but not necessarily any permission barriers. Many organizations

provide free access to their publications, content and data but the rules under which

the information is made available can differ between the hosts. Two specific

examples are cited:

       1) The Royal Society of Chemistry offers access to thousands of articles via

           its free access policies. Copyright remains with the society.

       2) The SureChem patent portal (vide infra) provides chemical structure-

           searchable access to patents online, free of charge. They do, however, sell

           their content to organizations, provide secure portal access and data

           export for organizations for a fee and have a business model enabling

           both free access and for profit activities.

       To most people, free access is not inferior to open access – just different.

Most scientists are overjoyed to have free access to information and data previously

not available and will willingly use such resources.

       Political decision makers are now directly engaged in reviewing the benefits of

OA (and even FA) to science and society and limiting Open Access. In November of

2007, President Bush vetoed a bill that aimed to make all National Institutes of

Health (NIH)-funded research publications freely available on the web and the furore

that arose from the appearance of the PubChem database as competition to other

repositories (specifically the Chemical Abstracts Service) has caused a significant rift

between members of the ACS and the society’s leadership. While these challenges

are yet to be fully navigated, it is clear that publishers will need to modify their

business models to address the drive towards more FA and OA resources.
Page 5 of 22

Meanwhile, groups such as the Public Library of Science and Chemistry Central,

discussed recently, are leading the charge for more Open Access.

       For the purpose of this article, both Free Access and Open Access will be

defined as: no barriers to using the system in order to derive value; no forced

registration or logins in order to use the system. This does not necessarily mean that

there might not also be fee-based services associated with the resource - a site can

have free and fee-based services on offer simultaneously.



Free and Open Access Online Chemistry Databases

       Depending on the definition of a database, there are many freely available on

the web. It is not uncommon to hear chemists comment about a downloadable

database of structures. This commonly refers to a file containing a number of

structures, commonly tens to tens of thousands in the Structure Data Format (SDF).

These files are generally imported to a database for viewing. There are many

hundreds of SDF files available online and, based on experience, a week of

downloading, importing and de-duplication could easily provide at least 15 million

unique structures. As this article goes to press the PubChem dataset now numbers

18 million unique structures and can be downloaded as a single data source. These

SDF files are commonly provided by chemical vendors for the purpose of facilitating

commercial sales. Such files can contain structure identifiers (names and numbers),

experimental or physical properties, file specific identifiers and, commonly, pricing

information. Since they are assembled in a heterogeneous manner such data are

plagued with inconsistencies and data quality issues.

       There are, however, a number of online database resources offering access to

valuable data and knowledge. Some of these could be thought of as “linkbases”, a

term which, for the sake of this article, can be considered as a repository of

molecular connection tables linking out to multiple sources of data and associated
Page 6 of 22

information. While this review cannot be exhaustive, we will examine a number of

these free resources and the value they are starting to deliver to chemists.



PubChem

       The highest profile online database is certainly PubChem. NIH launched the

database in 2004 as part of a suite of databases supporting the New Pathways to

Discovery component of their roadmap initiative. PubChem archives and organizes

information   about   the    biological   activities   of   chemical   compounds   into   a

comprehensive biomedical database to support the Molecular Libraries initiative

component of the roadmap. PubChem is the informatics backbone for the initiative

and is intended to empower the scientific community to use low molecular weight

chemical compounds in their research.

       PubChem consists of three databases (PubChem Compound, PubChem

Substance, and PubChem Bio-Assay) connected together and incorporated into the

Entrez information system of the National Center for Biotechnology Information

(NCBI). PubChem Compound contains 18 million unique structures and provides

biological property information for each compound through links to other Entrez

databases (see Figure 1 for an example). PubChem Substance contains records of

substances from depositors into the system. These are publishers, chemical vendors,

commercial databases and other sources. It provides descriptions of chemicals and

links to PubMed, protein 3D structures, and screening results. The PubChem

Compound database contains records of individual compounds. PubChem BioAssay

contains information about bioassays using specific terms pertinent to the bioassay.

       PubChem can be searched by alphanumeric text variables, such as names of

chemicals, property ranges or by structure, substructure or structural similarity. As

of December 2007 its content is approaching 38.7 million substances and 18.4

million unique structures.
Page 7 of 22

       The value of PubChem has been lauded by a number of advocates, especially

those evangelizing open access and open data initiatives. Certainly such a source of

data opens up new possibilities in regards to data mining and extraction. Zhou et al

[1] concluded that the system has an important role in as a central repository for

chemical vendors and content providers. This enables evaluation of commercial

compound libraries and saves biomedical researchers from the work associated with

gathering and searching commercial databases. They concluded that the hit-to-lead

decision making process in drug discovery programs can certainly benefit from the

ongoing annotation service provided by PubChem. They identified that over 35% of

the 5 million structures from chemical vendors or screening centers found in the

PubChem database currently are not present in the CAS database and suggested

that parties such as CAS could benefit from inclusion of the open-access chemical

and biological data found in PubChem into their own repositories.

       PubChem continues to grow in stature, content and capability. The number of

both substances and unique chemical structures continues to grow on an annual

basis. The bioassay data resulting from the NIH Roadmap initiative is likely to

continue to grow and PubChem will assume a prominent role in distributing the data

in a standard format. The integration to PubMed offers a necessary connection

between the scientific literature and associated chemical structures. PubMed includes

over 17 million citations from MEDLINE and other life science journals for biomedical

articles back to the 1950s. PubChem and PubMed are intimately linked as discussed

by Zhou et al. [1]. At the time of their study there were over 82,000 PubMed papers

and 1.6 million PubMed Central papers linked to PubChem.

       Despite the obvious value of PubChem, the platform has not been without its

detractors. There have been numerous heated debates regarding the position of CAS

relative to the resource and the reader is referred elsewhere for opinions and

commentaries. Others have commented on their concerns regarding the quality of
Page 8 of 22

the data content within PubChem. Shoichet [2], has commented that the screening

data are less rigorous than those in peer-reviewed articles, and contain many false

positives. Deposited data are not curated, and so mistakes in structures, units and

other characteristics can, and do, occur. Shoichet worries that chemists who use

PubChem may be sent on a wild goose chase. Williams and others have pointed to

the accuracy of some of the identifiers associated with the PubChem compounds. The

problems arise from the quality of submissions from the various data sources. There

are thousands of errors in the structure-identifier associations due to this

contamination and this can lead to the retrieval of incorrect chemical structures. It is

also common to have multiple representations of a single structure, due to

incomplete or total lack of stereochemistry for a molecule. PubChem is not resourced

to resolve these issues and these unfortunate errors have generally arisen due to

chemical vendors seeing PubChem as a potential path to commercial gain, rather

than its primary focus of supporting the Molecular Libraries initiative.



eMolecules

       eMolecules, originally released as Chmoogle prior to legal wranglings with a

famous search engine provider is a commercial entity that offers a free online

database of over 7 million unique chemical structures derived from more than 150

suppliers and offers scientists and procurement staff a path to identifying a vendor

for a particular chemical compound. By providing access to compounds for purchase

they are providing a free access online service similar to those of commercial

databases such as Symyx Available Chemical Directory, CAS’s ChemCats and

Cambridgesoft’s ChemACX, to name just three. They also offer a service to chemical

suppliers to customize a search page for the vendor and request online quotes.

       The system offers access to more than 4 million commercially available

screening   compounds     and   hundreds    of   thousands   of   building   blocks   and
Page 9 of 22

intermediates and access via ChemGate to NMR, MS and IR spectra from Wiley-VCH

for over 500,000 compounds, a fee-based service. eMolecules also provides links to

many sources of data for spectra, physical properties and biological data including

include the NIST WebBook, the National Cancer Institute, DrugBank and PubChem.

       eMolecules is presently fairly limited in its scope to offering a path to the

purchase of chemicals and links to the more popular government databases.

Nevertheless, the site is popular with chemists who are searching for chemicals and

the interface is intuitive and easy to use, a key element in attracting users.



Wikipedia

       For this author, Wikipedia certainly represents an important shift in the future

access of information associated with small molecules. A wiki is a type of computer

software that allows users easily to create, edit and link web pages. A wiki enables

documents to be written collaboratively, in a simple markup language using a web

browser, and is essentially a database for creating, browsing and searching

information. There are now thousands of pages describing small organic molecules,

inorganics, organometallics, polymers and even large biomolecules. Focusing on

small molecules, in general each one has a Drug Box or a Chemical infobox. The

drug box provides identifier information, in this case each identifier linking out to a

related resource, and chemical        data, pharmacokinetic     data and therapeutic

considerations. At present there are approximately 8000 articles with a chembox or

drugbox [3], with between 500-1000 articles added since May. The detailed

information offered on Wikipedia regarding a particular chemical or drug can be

excellent, see Figure 2, or weak.

       There been comments regarding the quality of Wikipedia in recent months.

Nevertheless, there are many dedicated supporters and contributors to the quality of

the online resource. The chemistry articles are not without issue and the drug and
Page 10 of 22

chemboxes specifically have been shown to contain errors. However, the advantage

of a wiki is that changes can be made within a few keystrokes and the quality is

immediately enhanced. This community curation process makes Wikipedia a very

important online chemistry resource, whose impact will only expand with time. Wikis,

and their usage for the purpose of communication in chemistry, will be discussed in

more detail in a separate article in this publication.



DrugBank

DrugBank [4] is a very valuable resource blending both bioinformatics and

cheminformatics data and combines detailed drug (i.e. chemical) data with

comprehensive drug target (i.e. protein) information. The Drugbank is hosted at the

University of Alberta, Canada and supported by Genome Alberta & Genome Canada,

a private, non-profit corporation whose mandate is to develop and implement a

national strategy in genomics and proteomics research.           At the time of release in

2006, the database contained >4100 drug entries including >800 FDA-approved

small   molecule and biotech       drugs, as    well     as   >3200   experimental   drugs.

Additionally, >14,000 protein or drug target sequences are linked to these drug

entries. Each DrugCard entry contains >80 data fields, with half of the information

being devoted to drug/chemical data and the other half devoted to drug target or

protein data. Many data fields are linked to other databases (KEGG, PubChem,

ChEBI, PDB and others). The database is fully-searchable, supporting extensive text,

sequence, chemical structure and relational query searches.

        DrugBank [4] has been used to facilitate in silico drug target discovery, drug

design, drug docking or screening, drug metabolism prediction, drug interaction

prediction and general pharmaceutical education. The version 2.0 release of

DrugBank [5] will add approximately 800 new drug entries and each DrugCard entry

will contain over 100 data fields with half of the information being devoted to
Page 11 of 22

drug/chemical    data    and   the    other   half   devoted     to   pharmacological,

pharmacogenomic and molecular biological data.

       DrugBank has committed to a semi-annual updating schedule to allow

information on newly approved and newly withdrawn drugs to be kept current. The

intention is to acquire experimental spectral data (NMR and MS specifically), and to

expand the coverage of nutraceuticals and herbal medicines.


Human Metabolome Database

The Human Metabolome Database (HMDB) [6] contains detailed information about

small molecule metabolites found in the human body and is used by scientists

working in the areas of metabolomics, clinical chemistry and biomarker discovery.

The database is developed by the same group responsible for DrugBank and, in

many ways, parallels the capabilities. The database links chemical data, clinical data,

and biochemistry data. The database currently contains nearly 3000 metabolite

entries and each MetaboCard entry contains more than 90 data fields devoted to

chemical, clinical data, enzymatic and biochemical data. Many data fields are

hyperlinked to other databases as with DrugBank and the database supports

extensive text, sequence, chemical structure and relational query searches.

       HMDB certainly offers a valuable resource to scientists investigating the

metabolome. The data are generally of high quality but, unfortunately, has inherited

some of the quality issues from PubChem regarding identifiers associated with

chemical structures (see page 17 of this online presentation).


ZINC

Zinc is a free database of commercially available compounds for virtual screening.

The library contains over 4.6 million molecules, each with a 3D structure and

gathered from the catalogs of compounds from vendors. All molecules in the

databases are assigned biologically-relevant protonation states and annotated with
Page 12 of 22

molecular properties. The database is available for free download in several common

file formats and a web-based search page, including a molecular drawing interface,

allows the database to be searched. This database facilitates the delivery of virtual

screening libraries to the community and continues to grow on an annual basis as

new chemicals are supplied by vendors.



SureChem


SureChem provides chemically-intelligent search of a patent database containing

over eight million US, European and World Patents. Using extraction heuristics to

identify chemical and trade names and conversion of the extracted entities to

chemical structures using a series of name to structure conversion tools, SureChem

has delivered a database integrated to nearly 9 million individual chemical structures

(see Figure 3 for an example search result on Gefitinib). The free access online portal

allows scientists to search the system based on structure, substructure or similarity

of structure, as well as the text-based searching expected for patent inquiries.

SureChem provide a 24 hour turnaround on the deposition of new patent data as

they are issued, thus providing essentially immediate access to structure-based

searching of patent space.


       The SureChem database contains a large number of fragment structures as

well as normal structures and this is to be expected due to the nature of patents.

Search options allow exclusion of these fragments prior to searching. While name-to-

structure conversion is not perfect and there will be some errors as a result of the

process, the SureChem team have already indicated their intention to enable real-

time curation of the data by users to facilitate annotation of the data and an ongoing

community improvement process as the system is used. The SureChem structure
Page 13 of 22

database has also been published to ChemSpider (vide infra) facilitating searches

from multiple systems.



ChemSpider and ChemRefer

       ChemSpider was released, by this author and associated team, to the public

in March 2007 with the lofty goal of “building a structure centric community for

chemists”. The database is focused on cheminformatics rather than bioinformatics

and manages data and knowledge regarding “small molecules” rather than proteins,

enzymes etc. Originally released with the PubChem dataset as the foundation

dataset, ChemSpider has grown into a resource containing almost 18 million unique

chemical structures and recently provided about 7 million unique compounds back to

the PubChem repository. The data sources have been gathered from chemical

vendors as well as commercial database vendors and publishers and members of the

Open Notebook Science community. ChemSpider has also integrated the SureChem

patent database collection of structures to facilitate links between the systems. The

database can be queried using structure/substructure searching and alphanumeric

text searching of both intrinsic, as well as predicted, molecular properties.

       ChemSpider has enabled capabilities not available to users of any of the other

systems. These include real time curation of the data, association of analytical data

with chemical structures, real-time deposition of single or batch chemical structures

(including with activity data) and transaction-based predictions of physicochemical

data. The ChemSpider developers have made available a series of web services to

allow integration to the system for the purpose of searching the system as well as

generation of InChI identifiers and conversion routines.

       The system integrates text-based searching of Open Access articles. This

capability, known as ChemRefer, presently searches over 50,000 OA Chemistry

articles. This should be compared with the much larger offering available within
Page 14 of 22

PubMed that includes over 17 million citations from MEDLINE and other life science

journals for biomedical articles back to the 1950s. The ChemRefer index is estimated

to grow to a quarter of a million articles in 2008. While MEDLINE has a concentration

on biomedicine ChemRefer will also abstract and index general chemistry articles,

wikis and blogs. Efforts are presently underway to extract chemical names from the

OA articles and convert these names to chemical structures using name to structure

conversion algorithms. These chemical structures will be deposited back to the

ChemSpider database thereby facilitating structure and substructure searching in

concert with text-based searching.

       ChemSpider has a focus on, and commitment to, community curation. The

social community aspects of the system, while still in their infancy, demonstrate the

potential of this approach. Real-time and rules-based curation of the data has

resulted in the removal of well over 150,000 incorrect identifiers. Over 100 spectra

have been added in just the few weeks since announcing the release of the capability

(made up of NMR, IR and Raman spectra). As proponents of the InChI identifiers,

progress has already been made by the ChemSpider team to collect RSS feeds and

publish chemical structures and links back to the original resources. With an advisory

group membership built up of medicinal chemists, cheminformatics experts,

members of the Open Source and commercial software communities and participants

from academia and the private sector, ChemSpider is unbiased in its efforts and

holds great promise for achieving their lofty goal of developing a structure centric

community for chemists. The team has recently announced the introduction of a

wiki-like environment for annotation of the chemical structures in the database, a

project they term WiChempedia, and will be utilizing both available Wikipedia content

and deposited content from users to enable the ongoing development of community

curated chemistry.
Page 15 of 22

       Since the author of this article is the host of the ChemSpider database I feel it

appropriate to comment on the technologies supporting and the motivation to

produce such a platform. The system is built primarily on commercial software using

a Microsoft technology platform as a result of ease of implementation, longevity and

and available skill sets. The motivation behind the platform was one of providing an

environment for collaboration and ongoing community contribution for the expansion

of the data and knowledge supported by the database. While Wikipedia is likely one

of the standards in terms of community participation for curation and authorship the

platform is not chemically aware in an informatics sense with the absence of

mundane capabilities such as structure and substructure searching. At the other

extreme PubChem is advanced in an informatics sense but is non-curated and offers

little opportunity for a chemist to do more than use the system. ChemSpider hopes

to be a gathering place for chemists to utilize an advanced cheminformatics platform

for adding, curating and enhancing structure-based information. The recent

extensions to the platform to perform virtual screening using the LASSO descriptors,

the intention to provide online tools to assist in analytical data analysis and the

ability to utilize text-mining based tools to interrogate the patent literature and

chemistry literature offer significant potential benefits to the drug discovery

community. Near term development will provide deeper integration to the biological

data supported on both Drugbank and PubChem.



Other Databases

       The list of databases and resources reviewed above is representative of the

type of information available online. Other highly regarded databases frequented by

this author include The Distributed Structure-Searchable Toxicity (DSSTox) Public

Database Network hosted by the US Environmental Protection Agency, KEGG and

CheBI. There are also many other resources available and the reader is referred to
Page 16 of 22

one of the many indexes of such databases available on the internet to identify

potential resources of interest.

       An increasing number of public databases will continue to become available,

but the challenge, even now, is how to integrate and access the data. In a recent

article Willighagen et al. [7] introduce userscripts to enrich biology and chemistry

related web resources by incorporating or linking to other computational or data

sources on the web. They showed how information from web pages can be used to

link to, search, and process information in other resources, thereby allowing

scientists to select and incorporate the appropriate web resources to enhance their

productivity. Such tools connecting open chemistry databases and user web pages is

an ideal path to more highly-integrated information sharing.

       The reader is pointed to other recent articles reviewing the availability of free

web tools and services supporting medicinal chemistry [8] and structure-based

virtual ligand-screening experiments [9]. These articles offer an excellent overview

of the increasingly rich offerings available to scientists. In regards to publicly

accessible databases a recent review article by Scior et al. discusses how large

compound databases can be used to facilitate the structure-activity relationships

studies in drug discovery [10].




Will Access to Free Resources Challenge the Commercial Sector?

       A number of organizations generate sizeable revenues from the creation of

chemistry databases for the life sciences industry. The annual fees for accessing this

information likely exceeds half a billion dollars. The Chemical Abstracts Service alone

generates revenue in excess of $250 million dollars. The primary advantage of

commercial databases is that they have been manually examined by skilled curators,
Page 17 of 22

addressing the tedious task of quality data-checking. Certainly, the aggregation of

data from multiple sources, both historical and modern, from multiple countries and

languages    and    from   sources    not   available   electronically,   are   significant

enhancements over what is available via an internet search. The question remains

how long will this remain an issue?

       Chemical Abstracts Services’ and other commercial tools and databases will

continue to deliver significant benefit to its users. However, the world as a whole are

likely to reap increasing benefits from the growing number of free access services

and content databases. While commercial vendors generally have a highly

moderated release cycle of new functionality and capabilities online services tend to

move at a faster pace adding new capabilities, resolving issues and adding fresh

technologies on a rolling basis. This type of drive both excites present users and

draws new users to the expanding offerings. Academics in particular are likely to

have an increased focus on the use of free access databases and tools as it is

demonstrative of the new found freedoms of information. The benefit for them of

course is reduced expenditures for the commercial offerings. This is further

exaggerated in third world countries where free access systems are the primary

resources for information since commercial offerings are simply out of reach due to

price barriers.

       Modern scientists are primarily concerned with what has happened in the

more immediate history specifically for new areas of science. Interest in antique

historical records rarely exists. In many ways, what is not available online is likely

not of interest to more and more researchers, whether this be appropriate or not.

Increasingly, librarians at universities and large pharma are giving away their print

collections in favor of electronic repositories of chemical journals. Despite some

views that search engine models may be parasitic in their nature, internet search

engines are increasingly likely to be the first port of call for the majority of scientists
Page 18 of 22

for two simple reasons – they are fast and they are free. As has already been

discussed, chemically-searchable patents are now available online, at no charge. The

present database, while containing errors, still provides value to its users and offers

access to information not available in other curated patent systems, specifically

access to prophetic compounds and 24 hour turnaround on the updating of patent

data after they are issued. Such immediate access to new data, the potential of the

web to provide alerts based on particular structure class and the interest of the

modern scientist in more recent information, offers a significant business advantage

to such offerings. In terms of data quality issues, the internet generation has already

demonstrated a willingness to curate, modify and enhance the quality of content as

modeled by Wikipedia. With the appropriate enhancements in place, online curation

and markup of the data in real time can quickly address errors in the data as has

already been demonstrated by the ChemSpider system.

       With the improvements promised by the semantic web, if there are data of

interest to be found, the search engines will facilitate it. As the President of CAS,

Robert Massie, commented: "Chemical Abstracts has to be better than Google, better

than our competitors, so that we can charge a premium price." It is this author’s

judgment that this will become increasingly difficult as science and technology

progress and the redesign of commercial database business models becomes more

necessary.



Conclusion

       Increasing access to free and open access databases of both chemistry and

biological data is certainly impacting the manner by which scientists access

information. These databases are additional tendrils in the web of internet resources

that continue to expand in their proliferation of freely-accessible data and

information, such as patents, open and free access peer-reviewed publications and
Page 19 of 22

software tools for the manipulation of chemistry related data. As data-mining tools

expand in their capabilities and performance, the chemistry databases available

online now, and in the future, are likely to offer even greater opportunities to benefit

the process of discovery. As these databases grow in both their content and quality,

there may be challenging times ahead in regards to the commercial business models

of publishers versus the drive towards more freely available data. This author

remains hopeful that the journey ahead can bring benefit to all parties.

       It should be noted that following the initial preparation of this article President

Bush signed a bill requiring the National Institutes of Health to mandate Open Access

for NIH-funded research. This act is likely to lead to a catalytic growth in Open

Access exposure.



About the Author

Antony Williams is the Host of ChemSpider. He has spent over a decade in the

commercial scientific software business as Chief Science Officer for Advanced

Chemistry Development and is an NMR spectroscopist by training with over 100

peer-reviewed publications. He has taken his passion for providing access to

chemistry-related information and software services to the masses and is now

applying his time to hosting ChemSpider, working alongside the intellect and

innovation making up its development team and immersing himself in the experience

of blogging. He can be contacted at antony.williams@chemspider.com.



Acknowledgements

The author would like to acknowledge the active participation of many Open Data,

Open Source, Open Standards (ODOSOS) contributors for participating in multi-

forum discussions regarding the changing environment of Open and Free Access. I

also extend my thanks to the many passionate individuals and teams who have
Page 20 of 22

contributed to the databases and systems reviewed herein. Your efforts are

facilitating better science and deserve acknowledgment.



References

[1] Zhou, Y. ,et al. (2007) Large-Scale Annotation of Small-Molecule Libraries Using

Public Databases. J. Chem. Inf. Model. 47, 1386-1394

[2] Baker, M. (2006) Open-access chemistry databases evolving slowly but not

surely, Nature Reviews, Drug Discovery, 5, 707-708

[3] Martin Walker, member of Wikipedia:Chemistry (private communication)

[4] Wishart, D.S. et al. (2006) DrugBank: a comprehensive resource for in silico

drug discovery and exploration, Nucleic Acids Res. 34, D668-72

[5] Wishart, D.S. et al. (2007) DrugBank: a knowledgebase for drugs, drug actions

and drug targets, Nucleic Acids Res. (in press) Nucleic Acids Research Advance

Access published online on November 29, 2007

(http://nar.oxfordjournals.org/cgi/content/abstract/gkm958v1)

[6] Wishart, D.S. et al. (2007) HMDB: The Human Metabolome Database. Nucleic

Acids Res. 35, D521-6

[7] Willighagen, E. L. et al. (2007) Userscripts for the Life Sciences, BMC

Bioinformatics, 8, 487 (in press). Published online on December, 2007.

http://www.biomedcentral.com/1471-2105/8/487

[8] Ertl, P. et al. (2007) Designing drugs on the internet? Free web tools and

services supporting medicinal chemistry, Curr. Top. Med. Chem. 7, 1491-501

[9] Villoutreix B.O. et al.(2007) Free resources to assist structure-based virtual

ligand screening experiments, Curr. Protein Pept. Sci. 8, 381-411

[10] Scior, T. et al. (2007); Large compound databases for structure-activity

relationships studies in drug discovery. Mini-Reviews in Med .l Chem. 7, 851-860.
Page 21 of 22

Figures




Figure 1 - The Compound Summary Page for Gefitinib in PubChem. Page 1 only is

shown. (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=123631)
Page 22 of 22




Figure 2: The DrugBox for Gefitinib from Wikipedia

(http://en.wikipedia.org/wiki/Gefitinib)




Figure 3. An example patent results page for SureChem (www.surechem.org), a

chemically intelligent, free access online patent search. A search on Gefitinib

provided an abundance of Patent information to examine.

Mais conteúdo relacionado

Mais procurados

Webometrics Unimas
Webometrics UnimasWebometrics Unimas
Webometrics UnimasHiram Ting
 
Frederick Friend: Where we are now in opening research results and data
Frederick Friend: Where we are now in opening research results and data Frederick Friend: Where we are now in opening research results and data
Frederick Friend: Where we are now in opening research results and data NeilStewartCity
 
Libraries and Research Data Curation: Barriers and Incentives for Preservatio...
Libraries and Research Data Curation: Barriers and Incentives for Preservatio...Libraries and Research Data Curation: Barriers and Incentives for Preservatio...
Libraries and Research Data Curation: Barriers and Incentives for Preservatio...University of California Curation Center
 
Integration of research literature and data (InFoLiS)
Integration of research literature and data (InFoLiS)Integration of research literature and data (InFoLiS)
Integration of research literature and data (InFoLiS)Philipp Zumstein
 
Supporting UC Research Data Management
Supporting UC Research Data ManagementSupporting UC Research Data Management
Supporting UC Research Data Managementslabrams
 
Understanding the Big Data Enterprise
Understanding the Big Data EnterpriseUnderstanding the Big Data Enterprise
Understanding the Big Data EnterprisePhilip Bourne
 
Moving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisMoving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisPhilip Bourne
 
There is No Intelligent Life Down Here
There is No Intelligent Life Down HereThere is No Intelligent Life Down Here
There is No Intelligent Life Down HerePhilip Bourne
 
Telling your research story with (alt)metrics
Telling your research story with (alt)metricsTelling your research story with (alt)metrics
Telling your research story with (alt)metricsPaul Groth
 
If Data Are The New Oil, How Do We Prevent Global Warming?
If Data Are The New Oil, How Do We Prevent Global Warming?If Data Are The New Oil, How Do We Prevent Global Warming?
If Data Are The New Oil, How Do We Prevent Global Warming?Philip Bourne
 
#ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love #ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love Kristi Holmes
 
Presentation the rationale, role and working methods of oa
Presentation the rationale, role and working methods of oaPresentation the rationale, role and working methods of oa
Presentation the rationale, role and working methods of oaCaroline Sutton
 
Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Philip Bourne
 

Mais procurados (20)

Webometrics Unimas
Webometrics UnimasWebometrics Unimas
Webometrics Unimas
 
Frederick Friend: Where we are now in opening research results and data
Frederick Friend: Where we are now in opening research results and data Frederick Friend: Where we are now in opening research results and data
Frederick Friend: Where we are now in opening research results and data
 
data.ac.uk briefing paper
data.ac.uk briefing paperdata.ac.uk briefing paper
data.ac.uk briefing paper
 
Libraries and Research Data Curation: Barriers and Incentives for Preservatio...
Libraries and Research Data Curation: Barriers and Incentives for Preservatio...Libraries and Research Data Curation: Barriers and Incentives for Preservatio...
Libraries and Research Data Curation: Barriers and Incentives for Preservatio...
 
Integration of research literature and data (InFoLiS)
Integration of research literature and data (InFoLiS)Integration of research literature and data (InFoLiS)
Integration of research literature and data (InFoLiS)
 
TIDSR
TIDSRTIDSR
TIDSR
 
Supporting UC Research Data Management
Supporting UC Research Data ManagementSupporting UC Research Data Management
Supporting UC Research Data Management
 
Webometrics report
Webometrics reportWebometrics report
Webometrics report
 
Iad talk
Iad talkIad talk
Iad talk
 
Understanding the Big Data Enterprise
Understanding the Big Data EnterpriseUnderstanding the Big Data Enterprise
Understanding the Big Data Enterprise
 
Moving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisMoving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT Analysis
 
There is No Intelligent Life Down Here
There is No Intelligent Life Down HereThere is No Intelligent Life Down Here
There is No Intelligent Life Down Here
 
Telling your research story with (alt)metrics
Telling your research story with (alt)metricsTelling your research story with (alt)metrics
Telling your research story with (alt)metrics
 
Cartegena051811
Cartegena051811Cartegena051811
Cartegena051811
 
If Data Are The New Oil, How Do We Prevent Global Warming?
If Data Are The New Oil, How Do We Prevent Global Warming?If Data Are The New Oil, How Do We Prevent Global Warming?
If Data Are The New Oil, How Do We Prevent Global Warming?
 
#ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love #ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love
 
Presentation the rationale, role and working methods of oa
Presentation the rationale, role and working methods of oaPresentation the rationale, role and working methods of oa
Presentation the rationale, role and working methods of oa
 
Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Data Science BD2K Update for NIH
Data Science BD2K Update for NIH
 
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-researchUc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
 
Searching the web general
Searching the web generalSearching the web general
Searching the web general
 

Semelhante a A perspective of Publicly Accessible/Open Access Chemistry Databases

Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchGigaScience, BGI Hong Kong
 
Sustainable Legal Framework for Open Access to Research Data
Sustainable Legal Framework for Open Access to Research DataSustainable Legal Framework for Open Access to Research Data
Sustainable Legal Framework for Open Access to Research Datagideon christian
 
The Traditional Licensing Model
The Traditional Licensing Model The Traditional Licensing Model
The Traditional Licensing Model blamb
 
Calhoun Data Sharing Panel IFLA Aug 2008
Calhoun Data Sharing Panel IFLA  Aug 2008Calhoun Data Sharing Panel IFLA  Aug 2008
Calhoun Data Sharing Panel IFLA Aug 2008Karen S Calhoun
 
Open PHACTS MIOSS may 2016
Open PHACTS MIOSS may 2016Open PHACTS MIOSS may 2016
Open PHACTS MIOSS may 2016open_phacts
 
1 s2.0-s0098791313000154-main
1 s2.0-s0098791313000154-main1 s2.0-s0098791313000154-main
1 s2.0-s0098791313000154-mainGraham Steel
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
Collaboration - theory & Practice
Collaboration - theory & PracticeCollaboration - theory & Practice
Collaboration - theory & PracticeSean Ekins
 

Semelhante a A perspective of Publicly Accessible/Open Access Chemistry Databases (20)

Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
 
Reaching out to collaborators and crowdsourcing for pharmaceutical research
Reaching out to collaborators and crowdsourcing for pharmaceutical research  Reaching out to collaborators and crowdsourcing for pharmaceutical research
Reaching out to collaborators and crowdsourcing for pharmaceutical research
 
Open Notebook Science and One Future for Scientific Research
Open Notebook Science and One Future for Scientific ResearchOpen Notebook Science and One Future for Scientific Research
Open Notebook Science and One Future for Scientific Research
 
Qualifying Online Information Resources for Chemists
Qualifying Online Information Resources for ChemistsQualifying Online Information Resources for Chemists
Qualifying Online Information Resources for Chemists
 
When pharmaceutical companies publish large datasets an abundance of riches o...
When pharmaceutical companies publish large datasets an abundance of riches o...When pharmaceutical companies publish large datasets an abundance of riches o...
When pharmaceutical companies publish large datasets an abundance of riches o...
 
Why open drug discovery needs four simple rules for licensing data and models
Why open drug discovery needs four simple rules for licensing data and modelsWhy open drug discovery needs four simple rules for licensing data and models
Why open drug discovery needs four simple rules for licensing data and models
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do research
 
ChemSpider as a hub for online chemical information resources
ChemSpider as a hub for online chemical information resources   ChemSpider as a hub for online chemical information resources
ChemSpider as a hub for online chemical information resources
 
Sustainable Legal Framework for Open Access to Research Data
Sustainable Legal Framework for Open Access to Research DataSustainable Legal Framework for Open Access to Research Data
Sustainable Legal Framework for Open Access to Research Data
 
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
 
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspnRSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
 
The Traditional Licensing Model
The Traditional Licensing Model The Traditional Licensing Model
The Traditional Licensing Model
 
Online Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery SystemsOnline Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery Systems
 
Calhoun Data Sharing Panel IFLA Aug 2008
Calhoun Data Sharing Panel IFLA  Aug 2008Calhoun Data Sharing Panel IFLA  Aug 2008
Calhoun Data Sharing Panel IFLA Aug 2008
 
Open PHACTS MIOSS may 2016
Open PHACTS MIOSS may 2016Open PHACTS MIOSS may 2016
Open PHACTS MIOSS may 2016
 
1 s2.0-s0098791313000154-main
1 s2.0-s0098791313000154-main1 s2.0-s0098791313000154-main
1 s2.0-s0098791313000154-main
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
Collaboration - theory & Practice
Collaboration - theory & PracticeCollaboration - theory & Practice
Collaboration - theory & Practice
 
Hjoseph Ticer09
Hjoseph Ticer09Hjoseph Ticer09
Hjoseph Ticer09
 
Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008
 

Último

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Último (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

A perspective of Publicly Accessible/Open Access Chemistry Databases

  • 1. Page 1 of 22 A perspective of Publicly Accessible/Open Access Chemistry Databases Antony J. Williams ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587 Phone: 919-341-8375; FAX: 919-300-5321 Corresponding Author:antony.williams@chemspider.com Keywords: Open Access Chemistry, Online databases, Internet chemistry Teaser Sentence: Online chemistry resources are facilitating access to information at an increasingly rapid pace offering better and faster decision making and speed up the process of discovery. Abstract The internet has spawned access to unprecedented levels of information. For chemists, the increasing number of resources they can utilize to access chemistry- related information provides them a valuable path to discovery of information, which was previously limited to commercial and, therefore, constrained resources. The diversity of information continues to expand dramatically and, coupled with an increasing awareness for quality, curation and improved tools for focused searches, chemists can now find valuable information within a few seconds using a few keystrokes. Shifting to publicly-available resources offers great promise to the benefits of science and society yet brings with it increasing concern from commercial entities. This article discusses the benefits and disruptions associated with an increase in publicly available scientific resources.
  • 2. Page 2 of 22 Ask a chemist how often they use the internet to search for science-related information online and it would be fair to expect the answer would be daily. Certainly the web now dominates as the primary portal to query for general information and data. Yet, despite the tremendous growth in scientific internet resources, the potential for providing useful chemistry information and data have only recently started to be tapped. Bioinformatics has led the charge in providing online access to data far ahead of the efforts in Chemistry. Open-access databases like GenBank and the Protein Data Bank have been assisting biologists to translate gene and protein sequences into biological relevance for well over two decades. Some of the responsibility for the differences in efforts has commonly been put onto the shoulders of publishers in Chemistry, whether for scientific articles or commercial chemistry databases. Nevertheless, societal thrust, evangelists and group efforts are forcing both free and open access (vide infra) to chemistry-related information. There are many indexes of chemistry databases online and this article is not intended to be yet another. Rather, this article will review both the availability and capabilities of online chemistry databases, particularly those offering free or open access. The progress in the availability of freely-accessible information is highly enabling and advantageous for the advancement of science and our future well- being, but likely seen as a disruptive force for commercial bodies. This shift is needed, not only to facilitate improved access to information in academia and government laboratories, but also in commercial organizations feeling the pressure of poor performance in the discovery and development processes. The pharmaceutical companies in particular should welcome improved access to chemistry-related information as their business dominance is taken to task with drugs coming off patent and no replacement blockbuster drugs in the pipeline.
  • 3. Page 3 of 22 In keeping with the web-based nature of this article, the majority of references will be to internet resources. At the time of submission all references were active but, as is the nature of the internet, these resources will age and may disappear. What is Open versus Free Access? There is much confusion around the differences between Open Access (OA) versus Free Access (FA). This is also accompanied by corporate protectionism and political battles which have found their way to Capitol Hill. Both OA and FA offer a great opportunity to the advancement of science by sharing data, information and knowledge as it is created. With these in place, publishing houses and institutional repositories of information, specifically structure databases with related information, are threatened by the potential impact on their business model and associated revenues. The first major international statement on open access was the Budapest Open Access Initiative (BOAI), in February 2002 (FAQs online). The definition of Open Access from the BOAI frequently asked questions website is as follows: “By 'open access' to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” With this definition in mind, one can understand the potential impact to publishing business models. There are other concerns in the OA model but these are reviewed elsewhere.
  • 4. Page 4 of 22 Free Access is not equivalent to OA, but, by definition, OA assumes Free Access as part of its model. This author could not find an equivalent definition for Free Access online but suggests the following: Free access is access that removes price barriers but not necessarily any permission barriers. Many organizations provide free access to their publications, content and data but the rules under which the information is made available can differ between the hosts. Two specific examples are cited: 1) The Royal Society of Chemistry offers access to thousands of articles via its free access policies. Copyright remains with the society. 2) The SureChem patent portal (vide infra) provides chemical structure- searchable access to patents online, free of charge. They do, however, sell their content to organizations, provide secure portal access and data export for organizations for a fee and have a business model enabling both free access and for profit activities. To most people, free access is not inferior to open access – just different. Most scientists are overjoyed to have free access to information and data previously not available and will willingly use such resources. Political decision makers are now directly engaged in reviewing the benefits of OA (and even FA) to science and society and limiting Open Access. In November of 2007, President Bush vetoed a bill that aimed to make all National Institutes of Health (NIH)-funded research publications freely available on the web and the furore that arose from the appearance of the PubChem database as competition to other repositories (specifically the Chemical Abstracts Service) has caused a significant rift between members of the ACS and the society’s leadership. While these challenges are yet to be fully navigated, it is clear that publishers will need to modify their business models to address the drive towards more FA and OA resources.
  • 5. Page 5 of 22 Meanwhile, groups such as the Public Library of Science and Chemistry Central, discussed recently, are leading the charge for more Open Access. For the purpose of this article, both Free Access and Open Access will be defined as: no barriers to using the system in order to derive value; no forced registration or logins in order to use the system. This does not necessarily mean that there might not also be fee-based services associated with the resource - a site can have free and fee-based services on offer simultaneously. Free and Open Access Online Chemistry Databases Depending on the definition of a database, there are many freely available on the web. It is not uncommon to hear chemists comment about a downloadable database of structures. This commonly refers to a file containing a number of structures, commonly tens to tens of thousands in the Structure Data Format (SDF). These files are generally imported to a database for viewing. There are many hundreds of SDF files available online and, based on experience, a week of downloading, importing and de-duplication could easily provide at least 15 million unique structures. As this article goes to press the PubChem dataset now numbers 18 million unique structures and can be downloaded as a single data source. These SDF files are commonly provided by chemical vendors for the purpose of facilitating commercial sales. Such files can contain structure identifiers (names and numbers), experimental or physical properties, file specific identifiers and, commonly, pricing information. Since they are assembled in a heterogeneous manner such data are plagued with inconsistencies and data quality issues. There are, however, a number of online database resources offering access to valuable data and knowledge. Some of these could be thought of as “linkbases”, a term which, for the sake of this article, can be considered as a repository of molecular connection tables linking out to multiple sources of data and associated
  • 6. Page 6 of 22 information. While this review cannot be exhaustive, we will examine a number of these free resources and the value they are starting to deliver to chemists. PubChem The highest profile online database is certainly PubChem. NIH launched the database in 2004 as part of a suite of databases supporting the New Pathways to Discovery component of their roadmap initiative. PubChem archives and organizes information about the biological activities of chemical compounds into a comprehensive biomedical database to support the Molecular Libraries initiative component of the roadmap. PubChem is the informatics backbone for the initiative and is intended to empower the scientific community to use low molecular weight chemical compounds in their research. PubChem consists of three databases (PubChem Compound, PubChem Substance, and PubChem Bio-Assay) connected together and incorporated into the Entrez information system of the National Center for Biotechnology Information (NCBI). PubChem Compound contains 18 million unique structures and provides biological property information for each compound through links to other Entrez databases (see Figure 1 for an example). PubChem Substance contains records of substances from depositors into the system. These are publishers, chemical vendors, commercial databases and other sources. It provides descriptions of chemicals and links to PubMed, protein 3D structures, and screening results. The PubChem Compound database contains records of individual compounds. PubChem BioAssay contains information about bioassays using specific terms pertinent to the bioassay. PubChem can be searched by alphanumeric text variables, such as names of chemicals, property ranges or by structure, substructure or structural similarity. As of December 2007 its content is approaching 38.7 million substances and 18.4 million unique structures.
  • 7. Page 7 of 22 The value of PubChem has been lauded by a number of advocates, especially those evangelizing open access and open data initiatives. Certainly such a source of data opens up new possibilities in regards to data mining and extraction. Zhou et al [1] concluded that the system has an important role in as a central repository for chemical vendors and content providers. This enables evaluation of commercial compound libraries and saves biomedical researchers from the work associated with gathering and searching commercial databases. They concluded that the hit-to-lead decision making process in drug discovery programs can certainly benefit from the ongoing annotation service provided by PubChem. They identified that over 35% of the 5 million structures from chemical vendors or screening centers found in the PubChem database currently are not present in the CAS database and suggested that parties such as CAS could benefit from inclusion of the open-access chemical and biological data found in PubChem into their own repositories. PubChem continues to grow in stature, content and capability. The number of both substances and unique chemical structures continues to grow on an annual basis. The bioassay data resulting from the NIH Roadmap initiative is likely to continue to grow and PubChem will assume a prominent role in distributing the data in a standard format. The integration to PubMed offers a necessary connection between the scientific literature and associated chemical structures. PubMed includes over 17 million citations from MEDLINE and other life science journals for biomedical articles back to the 1950s. PubChem and PubMed are intimately linked as discussed by Zhou et al. [1]. At the time of their study there were over 82,000 PubMed papers and 1.6 million PubMed Central papers linked to PubChem. Despite the obvious value of PubChem, the platform has not been without its detractors. There have been numerous heated debates regarding the position of CAS relative to the resource and the reader is referred elsewhere for opinions and commentaries. Others have commented on their concerns regarding the quality of
  • 8. Page 8 of 22 the data content within PubChem. Shoichet [2], has commented that the screening data are less rigorous than those in peer-reviewed articles, and contain many false positives. Deposited data are not curated, and so mistakes in structures, units and other characteristics can, and do, occur. Shoichet worries that chemists who use PubChem may be sent on a wild goose chase. Williams and others have pointed to the accuracy of some of the identifiers associated with the PubChem compounds. The problems arise from the quality of submissions from the various data sources. There are thousands of errors in the structure-identifier associations due to this contamination and this can lead to the retrieval of incorrect chemical structures. It is also common to have multiple representations of a single structure, due to incomplete or total lack of stereochemistry for a molecule. PubChem is not resourced to resolve these issues and these unfortunate errors have generally arisen due to chemical vendors seeing PubChem as a potential path to commercial gain, rather than its primary focus of supporting the Molecular Libraries initiative. eMolecules eMolecules, originally released as Chmoogle prior to legal wranglings with a famous search engine provider is a commercial entity that offers a free online database of over 7 million unique chemical structures derived from more than 150 suppliers and offers scientists and procurement staff a path to identifying a vendor for a particular chemical compound. By providing access to compounds for purchase they are providing a free access online service similar to those of commercial databases such as Symyx Available Chemical Directory, CAS’s ChemCats and Cambridgesoft’s ChemACX, to name just three. They also offer a service to chemical suppliers to customize a search page for the vendor and request online quotes. The system offers access to more than 4 million commercially available screening compounds and hundreds of thousands of building blocks and
  • 9. Page 9 of 22 intermediates and access via ChemGate to NMR, MS and IR spectra from Wiley-VCH for over 500,000 compounds, a fee-based service. eMolecules also provides links to many sources of data for spectra, physical properties and biological data including include the NIST WebBook, the National Cancer Institute, DrugBank and PubChem. eMolecules is presently fairly limited in its scope to offering a path to the purchase of chemicals and links to the more popular government databases. Nevertheless, the site is popular with chemists who are searching for chemicals and the interface is intuitive and easy to use, a key element in attracting users. Wikipedia For this author, Wikipedia certainly represents an important shift in the future access of information associated with small molecules. A wiki is a type of computer software that allows users easily to create, edit and link web pages. A wiki enables documents to be written collaboratively, in a simple markup language using a web browser, and is essentially a database for creating, browsing and searching information. There are now thousands of pages describing small organic molecules, inorganics, organometallics, polymers and even large biomolecules. Focusing on small molecules, in general each one has a Drug Box or a Chemical infobox. The drug box provides identifier information, in this case each identifier linking out to a related resource, and chemical data, pharmacokinetic data and therapeutic considerations. At present there are approximately 8000 articles with a chembox or drugbox [3], with between 500-1000 articles added since May. The detailed information offered on Wikipedia regarding a particular chemical or drug can be excellent, see Figure 2, or weak. There been comments regarding the quality of Wikipedia in recent months. Nevertheless, there are many dedicated supporters and contributors to the quality of the online resource. The chemistry articles are not without issue and the drug and
  • 10. Page 10 of 22 chemboxes specifically have been shown to contain errors. However, the advantage of a wiki is that changes can be made within a few keystrokes and the quality is immediately enhanced. This community curation process makes Wikipedia a very important online chemistry resource, whose impact will only expand with time. Wikis, and their usage for the purpose of communication in chemistry, will be discussed in more detail in a separate article in this publication. DrugBank DrugBank [4] is a very valuable resource blending both bioinformatics and cheminformatics data and combines detailed drug (i.e. chemical) data with comprehensive drug target (i.e. protein) information. The Drugbank is hosted at the University of Alberta, Canada and supported by Genome Alberta & Genome Canada, a private, non-profit corporation whose mandate is to develop and implement a national strategy in genomics and proteomics research. At the time of release in 2006, the database contained >4100 drug entries including >800 FDA-approved small molecule and biotech drugs, as well as >3200 experimental drugs. Additionally, >14,000 protein or drug target sequences are linked to these drug entries. Each DrugCard entry contains >80 data fields, with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. Many data fields are linked to other databases (KEGG, PubChem, ChEBI, PDB and others). The database is fully-searchable, supporting extensive text, sequence, chemical structure and relational query searches. DrugBank [4] has been used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. The version 2.0 release of DrugBank [5] will add approximately 800 new drug entries and each DrugCard entry will contain over 100 data fields with half of the information being devoted to
  • 11. Page 11 of 22 drug/chemical data and the other half devoted to pharmacological, pharmacogenomic and molecular biological data. DrugBank has committed to a semi-annual updating schedule to allow information on newly approved and newly withdrawn drugs to be kept current. The intention is to acquire experimental spectral data (NMR and MS specifically), and to expand the coverage of nutraceuticals and herbal medicines. Human Metabolome Database The Human Metabolome Database (HMDB) [6] contains detailed information about small molecule metabolites found in the human body and is used by scientists working in the areas of metabolomics, clinical chemistry and biomarker discovery. The database is developed by the same group responsible for DrugBank and, in many ways, parallels the capabilities. The database links chemical data, clinical data, and biochemistry data. The database currently contains nearly 3000 metabolite entries and each MetaboCard entry contains more than 90 data fields devoted to chemical, clinical data, enzymatic and biochemical data. Many data fields are hyperlinked to other databases as with DrugBank and the database supports extensive text, sequence, chemical structure and relational query searches. HMDB certainly offers a valuable resource to scientists investigating the metabolome. The data are generally of high quality but, unfortunately, has inherited some of the quality issues from PubChem regarding identifiers associated with chemical structures (see page 17 of this online presentation). ZINC Zinc is a free database of commercially available compounds for virtual screening. The library contains over 4.6 million molecules, each with a 3D structure and gathered from the catalogs of compounds from vendors. All molecules in the databases are assigned biologically-relevant protonation states and annotated with
  • 12. Page 12 of 22 molecular properties. The database is available for free download in several common file formats and a web-based search page, including a molecular drawing interface, allows the database to be searched. This database facilitates the delivery of virtual screening libraries to the community and continues to grow on an annual basis as new chemicals are supplied by vendors. SureChem SureChem provides chemically-intelligent search of a patent database containing over eight million US, European and World Patents. Using extraction heuristics to identify chemical and trade names and conversion of the extracted entities to chemical structures using a series of name to structure conversion tools, SureChem has delivered a database integrated to nearly 9 million individual chemical structures (see Figure 3 for an example search result on Gefitinib). The free access online portal allows scientists to search the system based on structure, substructure or similarity of structure, as well as the text-based searching expected for patent inquiries. SureChem provide a 24 hour turnaround on the deposition of new patent data as they are issued, thus providing essentially immediate access to structure-based searching of patent space. The SureChem database contains a large number of fragment structures as well as normal structures and this is to be expected due to the nature of patents. Search options allow exclusion of these fragments prior to searching. While name-to- structure conversion is not perfect and there will be some errors as a result of the process, the SureChem team have already indicated their intention to enable real- time curation of the data by users to facilitate annotation of the data and an ongoing community improvement process as the system is used. The SureChem structure
  • 13. Page 13 of 22 database has also been published to ChemSpider (vide infra) facilitating searches from multiple systems. ChemSpider and ChemRefer ChemSpider was released, by this author and associated team, to the public in March 2007 with the lofty goal of “building a structure centric community for chemists”. The database is focused on cheminformatics rather than bioinformatics and manages data and knowledge regarding “small molecules” rather than proteins, enzymes etc. Originally released with the PubChem dataset as the foundation dataset, ChemSpider has grown into a resource containing almost 18 million unique chemical structures and recently provided about 7 million unique compounds back to the PubChem repository. The data sources have been gathered from chemical vendors as well as commercial database vendors and publishers and members of the Open Notebook Science community. ChemSpider has also integrated the SureChem patent database collection of structures to facilitate links between the systems. The database can be queried using structure/substructure searching and alphanumeric text searching of both intrinsic, as well as predicted, molecular properties. ChemSpider has enabled capabilities not available to users of any of the other systems. These include real time curation of the data, association of analytical data with chemical structures, real-time deposition of single or batch chemical structures (including with activity data) and transaction-based predictions of physicochemical data. The ChemSpider developers have made available a series of web services to allow integration to the system for the purpose of searching the system as well as generation of InChI identifiers and conversion routines. The system integrates text-based searching of Open Access articles. This capability, known as ChemRefer, presently searches over 50,000 OA Chemistry articles. This should be compared with the much larger offering available within
  • 14. Page 14 of 22 PubMed that includes over 17 million citations from MEDLINE and other life science journals for biomedical articles back to the 1950s. The ChemRefer index is estimated to grow to a quarter of a million articles in 2008. While MEDLINE has a concentration on biomedicine ChemRefer will also abstract and index general chemistry articles, wikis and blogs. Efforts are presently underway to extract chemical names from the OA articles and convert these names to chemical structures using name to structure conversion algorithms. These chemical structures will be deposited back to the ChemSpider database thereby facilitating structure and substructure searching in concert with text-based searching. ChemSpider has a focus on, and commitment to, community curation. The social community aspects of the system, while still in their infancy, demonstrate the potential of this approach. Real-time and rules-based curation of the data has resulted in the removal of well over 150,000 incorrect identifiers. Over 100 spectra have been added in just the few weeks since announcing the release of the capability (made up of NMR, IR and Raman spectra). As proponents of the InChI identifiers, progress has already been made by the ChemSpider team to collect RSS feeds and publish chemical structures and links back to the original resources. With an advisory group membership built up of medicinal chemists, cheminformatics experts, members of the Open Source and commercial software communities and participants from academia and the private sector, ChemSpider is unbiased in its efforts and holds great promise for achieving their lofty goal of developing a structure centric community for chemists. The team has recently announced the introduction of a wiki-like environment for annotation of the chemical structures in the database, a project they term WiChempedia, and will be utilizing both available Wikipedia content and deposited content from users to enable the ongoing development of community curated chemistry.
  • 15. Page 15 of 22 Since the author of this article is the host of the ChemSpider database I feel it appropriate to comment on the technologies supporting and the motivation to produce such a platform. The system is built primarily on commercial software using a Microsoft technology platform as a result of ease of implementation, longevity and and available skill sets. The motivation behind the platform was one of providing an environment for collaboration and ongoing community contribution for the expansion of the data and knowledge supported by the database. While Wikipedia is likely one of the standards in terms of community participation for curation and authorship the platform is not chemically aware in an informatics sense with the absence of mundane capabilities such as structure and substructure searching. At the other extreme PubChem is advanced in an informatics sense but is non-curated and offers little opportunity for a chemist to do more than use the system. ChemSpider hopes to be a gathering place for chemists to utilize an advanced cheminformatics platform for adding, curating and enhancing structure-based information. The recent extensions to the platform to perform virtual screening using the LASSO descriptors, the intention to provide online tools to assist in analytical data analysis and the ability to utilize text-mining based tools to interrogate the patent literature and chemistry literature offer significant potential benefits to the drug discovery community. Near term development will provide deeper integration to the biological data supported on both Drugbank and PubChem. Other Databases The list of databases and resources reviewed above is representative of the type of information available online. Other highly regarded databases frequented by this author include The Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network hosted by the US Environmental Protection Agency, KEGG and CheBI. There are also many other resources available and the reader is referred to
  • 16. Page 16 of 22 one of the many indexes of such databases available on the internet to identify potential resources of interest. An increasing number of public databases will continue to become available, but the challenge, even now, is how to integrate and access the data. In a recent article Willighagen et al. [7] introduce userscripts to enrich biology and chemistry related web resources by incorporating or linking to other computational or data sources on the web. They showed how information from web pages can be used to link to, search, and process information in other resources, thereby allowing scientists to select and incorporate the appropriate web resources to enhance their productivity. Such tools connecting open chemistry databases and user web pages is an ideal path to more highly-integrated information sharing. The reader is pointed to other recent articles reviewing the availability of free web tools and services supporting medicinal chemistry [8] and structure-based virtual ligand-screening experiments [9]. These articles offer an excellent overview of the increasingly rich offerings available to scientists. In regards to publicly accessible databases a recent review article by Scior et al. discusses how large compound databases can be used to facilitate the structure-activity relationships studies in drug discovery [10]. Will Access to Free Resources Challenge the Commercial Sector? A number of organizations generate sizeable revenues from the creation of chemistry databases for the life sciences industry. The annual fees for accessing this information likely exceeds half a billion dollars. The Chemical Abstracts Service alone generates revenue in excess of $250 million dollars. The primary advantage of commercial databases is that they have been manually examined by skilled curators,
  • 17. Page 17 of 22 addressing the tedious task of quality data-checking. Certainly, the aggregation of data from multiple sources, both historical and modern, from multiple countries and languages and from sources not available electronically, are significant enhancements over what is available via an internet search. The question remains how long will this remain an issue? Chemical Abstracts Services’ and other commercial tools and databases will continue to deliver significant benefit to its users. However, the world as a whole are likely to reap increasing benefits from the growing number of free access services and content databases. While commercial vendors generally have a highly moderated release cycle of new functionality and capabilities online services tend to move at a faster pace adding new capabilities, resolving issues and adding fresh technologies on a rolling basis. This type of drive both excites present users and draws new users to the expanding offerings. Academics in particular are likely to have an increased focus on the use of free access databases and tools as it is demonstrative of the new found freedoms of information. The benefit for them of course is reduced expenditures for the commercial offerings. This is further exaggerated in third world countries where free access systems are the primary resources for information since commercial offerings are simply out of reach due to price barriers. Modern scientists are primarily concerned with what has happened in the more immediate history specifically for new areas of science. Interest in antique historical records rarely exists. In many ways, what is not available online is likely not of interest to more and more researchers, whether this be appropriate or not. Increasingly, librarians at universities and large pharma are giving away their print collections in favor of electronic repositories of chemical journals. Despite some views that search engine models may be parasitic in their nature, internet search engines are increasingly likely to be the first port of call for the majority of scientists
  • 18. Page 18 of 22 for two simple reasons – they are fast and they are free. As has already been discussed, chemically-searchable patents are now available online, at no charge. The present database, while containing errors, still provides value to its users and offers access to information not available in other curated patent systems, specifically access to prophetic compounds and 24 hour turnaround on the updating of patent data after they are issued. Such immediate access to new data, the potential of the web to provide alerts based on particular structure class and the interest of the modern scientist in more recent information, offers a significant business advantage to such offerings. In terms of data quality issues, the internet generation has already demonstrated a willingness to curate, modify and enhance the quality of content as modeled by Wikipedia. With the appropriate enhancements in place, online curation and markup of the data in real time can quickly address errors in the data as has already been demonstrated by the ChemSpider system. With the improvements promised by the semantic web, if there are data of interest to be found, the search engines will facilitate it. As the President of CAS, Robert Massie, commented: "Chemical Abstracts has to be better than Google, better than our competitors, so that we can charge a premium price." It is this author’s judgment that this will become increasingly difficult as science and technology progress and the redesign of commercial database business models becomes more necessary. Conclusion Increasing access to free and open access databases of both chemistry and biological data is certainly impacting the manner by which scientists access information. These databases are additional tendrils in the web of internet resources that continue to expand in their proliferation of freely-accessible data and information, such as patents, open and free access peer-reviewed publications and
  • 19. Page 19 of 22 software tools for the manipulation of chemistry related data. As data-mining tools expand in their capabilities and performance, the chemistry databases available online now, and in the future, are likely to offer even greater opportunities to benefit the process of discovery. As these databases grow in both their content and quality, there may be challenging times ahead in regards to the commercial business models of publishers versus the drive towards more freely available data. This author remains hopeful that the journey ahead can bring benefit to all parties. It should be noted that following the initial preparation of this article President Bush signed a bill requiring the National Institutes of Health to mandate Open Access for NIH-funded research. This act is likely to lead to a catalytic growth in Open Access exposure. About the Author Antony Williams is the Host of ChemSpider. He has spent over a decade in the commercial scientific software business as Chief Science Officer for Advanced Chemistry Development and is an NMR spectroscopist by training with over 100 peer-reviewed publications. He has taken his passion for providing access to chemistry-related information and software services to the masses and is now applying his time to hosting ChemSpider, working alongside the intellect and innovation making up its development team and immersing himself in the experience of blogging. He can be contacted at antony.williams@chemspider.com. Acknowledgements The author would like to acknowledge the active participation of many Open Data, Open Source, Open Standards (ODOSOS) contributors for participating in multi- forum discussions regarding the changing environment of Open and Free Access. I also extend my thanks to the many passionate individuals and teams who have
  • 20. Page 20 of 22 contributed to the databases and systems reviewed herein. Your efforts are facilitating better science and deserve acknowledgment. References [1] Zhou, Y. ,et al. (2007) Large-Scale Annotation of Small-Molecule Libraries Using Public Databases. J. Chem. Inf. Model. 47, 1386-1394 [2] Baker, M. (2006) Open-access chemistry databases evolving slowly but not surely, Nature Reviews, Drug Discovery, 5, 707-708 [3] Martin Walker, member of Wikipedia:Chemistry (private communication) [4] Wishart, D.S. et al. (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration, Nucleic Acids Res. 34, D668-72 [5] Wishart, D.S. et al. (2007) DrugBank: a knowledgebase for drugs, drug actions and drug targets, Nucleic Acids Res. (in press) Nucleic Acids Research Advance Access published online on November 29, 2007 (http://nar.oxfordjournals.org/cgi/content/abstract/gkm958v1) [6] Wishart, D.S. et al. (2007) HMDB: The Human Metabolome Database. Nucleic Acids Res. 35, D521-6 [7] Willighagen, E. L. et al. (2007) Userscripts for the Life Sciences, BMC Bioinformatics, 8, 487 (in press). Published online on December, 2007. http://www.biomedcentral.com/1471-2105/8/487 [8] Ertl, P. et al. (2007) Designing drugs on the internet? Free web tools and services supporting medicinal chemistry, Curr. Top. Med. Chem. 7, 1491-501 [9] Villoutreix B.O. et al.(2007) Free resources to assist structure-based virtual ligand screening experiments, Curr. Protein Pept. Sci. 8, 381-411 [10] Scior, T. et al. (2007); Large compound databases for structure-activity relationships studies in drug discovery. Mini-Reviews in Med .l Chem. 7, 851-860.
  • 21. Page 21 of 22 Figures Figure 1 - The Compound Summary Page for Gefitinib in PubChem. Page 1 only is shown. (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=123631)
  • 22. Page 22 of 22 Figure 2: The DrugBox for Gefitinib from Wikipedia (http://en.wikipedia.org/wiki/Gefitinib) Figure 3. An example patent results page for SureChem (www.surechem.org), a chemically intelligent, free access online patent search. A search on Gefitinib provided an abundance of Patent information to examine.