This document discusses key aspects of building databases to catalog global biodiversity in the 2000s, including standards, technology, data sharing challenges, and classification methods. It covers how database infrastructure requires stable standards and technology to ensure data accessibility over time. Issues around data ownership, privacy, and ensuring data can be shared and reused across disciplines are also addressed. Classification systems are evolving from paper-based to digital formats using tools like cladistics and computer programs to help organize the vast amounts of data being collected through worldwide biodiversity projects.
2. Four Key Aspects Database Infrastructure Standards—flexible, stable Technology—stable Communication Data Sharing Ownership Disarticulation Data collection
3. Four Key Aspects Distributed Collective Practice Collaborate work New Knowledge Economy Accounting for life Development of Classification Cladistics The Future
5. Standards Why do we need standards Example of air-conditioner industry Diameter Match between screw and the hole on the panel Reasons for database Need ‘handshake’ among various media MIME<Multipurpose Internet Mail Extensions>protocol Each layer of infrastructure requires its own set of standards Need standardized categories.
9. Standards Standards will not always win Why? The best standard maybe doesn’t have best market Standards setting is a key site of political work The inferior standard may be respected by the political agency. ( Such as standards-setting bodies)
11. Standards Interoperability Some Related Standards 1. ANSI/NISO Z39.50 ANSI/NISO Z39.50 is the American National Standard Information Retrieval Application Service Definition and Protocol Specification for Open Systems Interconnection. IT makes it easier to use large information databases by standardizing the procedures and features for searching and retrieving information.
13. Standards Interoperability Some Related Standards 1. ANSI/NISO Z39.50 A single enquiry over multiple databases. widely adopter in the library world.
14. Standards Interoperability Some Related Standards 2. XML Extensible Markup Language (XML) is a set of rules for encoding documents in machine-readable form. Two extremes: a. Colonial model b. Democratic model (win out) People’s established computing environment
15. Technology Technology must be stable Nothing to guarantee the stability of vast data sets Failure of Paul Otlet’s well catalogued microfiches Development of computer memory Hard to retrieve information
16. Technology Technology must stable Data accessible and usable Infrastructure will require a continued maintenance effort Reasons a. Data is passed from one medium to another b. Data is analyzed by one generation of database technology to the next.
17. Issues of Communication Problem of reliable metadata Metadata—data about data The blue lines are metadata
18. Issues of Communication Problem of reliable metadata The standard name of certain kinds of data Searchable—easy to search over multiple database Issue—how detail does the name of data should be? Lack of details— the information of data is useless Too many details— longer time, more work
19. Issues of Communication Dublin code The Dublin Core set ofmetadata elements provides a small and fundamental group of text elements through which most resources can be described and cataloged. The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 metadata elements: Language Relation Coverage Rights Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source
21. Ownership Control of knowledge Mid-nineteenth century: only professionally trained scientists and doctors New information economy: from many people Example: patients group
22. Ownership Privacy Keep data private is difficult : Example: data is complied by third-company to generate a new, marketable form of knowledge New Patterns of ownership Science has frequently been analyzed as a “public good” Increasing privatization of knowledge : It is unclear to what extent the vaunted openness of the scientific community will last
23.
24.
25. Data Collection Deal with old data Difficulties Scientific paper don’t in general offer enough information to allow an experiment or procedure to be repeated. The distributed database is becoming a new model form of scientific publication in its own right Issues of Update No automatic update from one field to a cognate one Scientist are not able to share information across discipline divides
26. Data Collection International Technoscience Purpose: Narrow the gaps between countries Issues: People do not have equal knowledge Access is never really equal Government have doubts of the usefulness of opening the database onto internet.
28. Collaborative Work Management structures in universities and industry still tend to support the heroic myth of the individual researcher. What kind of value the large publishing houses add to journal production. Great attention must be paid to the social and organizational setting of technoscientific work
29. New Knowledge Economy Three central issues The development of flexible, stable data standard The generation of protocols for data sharing The restructuring of scientific careers
32. Development of Classification Importance of classification 18th-19th centuries : botanist must know all genera, and commit their names to memory, but cannot be expected to remember all specific names. ( A.J. Cain, 1958) Later part of 19th century: new information technologies developed which permitted the easy storage and coding of larger amounts of data than could previously be easily manipulated. (Chandler,1977),(Yates,1989)
33. Development of Classification Example of classification Paper-based archival practice. Issues: hard to reclassified Type specimen had to be relocated physically So do Series of articles or books
34. Development of Classification Example of classification Multifaceted classification system Improve: Enabling the classifications to be ordered in multiple ways, rather than in a single Example: A collection of books might be classified using an author facet, a subject facet, a date facet
35. Development of Classification Example of classification Hierarchical classification (for reading the past) E.F. Codd In the early 1970s Split physical storage of data in the computer and the representation of that data. Disadvantage: becomes awkward to introduce other levels of taxonomic category as an afterthought. Improve method: one record for every name, regardless of its taxonomic level
36. Cladistics Definition It is a method of classifying species of organisms into groups called clades, which consist of 1) all the descendants of an ancestral organism and 2) the ancestor itself. Features : Give a more regular algorithm for determining phylogeny Focusing attention on shared, derived characteristics of set organisms Using ‘outgroup’ comparisons to develop the classification system
37. Cladistics Tree of life Cladists use cladograms, diagrams which show ancestral relations between taxa, to represent the evolutionary tree of life Charles Darwin (1809–1882) was the first to produce an evolutionary tree of life
39. Cladistics Computer programs in cladistics Undertaken using Swofford’s (1985) package PAUP version 2.4installed on a Cyber mainframe computer and version 2.4.1 on an amstrad 1512 PC David Swofford’s PAUP is a software package for inference of evolutionary trees Purpose: follow a given algorithm for generating and testing cladograms
41. Cladistics Computer programs in cladistics Issues: The packages produce variable results and cannot possibly look at all the possibilities, since there is NP-complete problem. Algorithm issues
42. The Future Store the life Life is described as itself a program, with DNA being code. IF everything is information, then life can equally well be “stored”