The document discusses the development of the Human Protein Reference Database (HPRD). It describes how Zope, an open source content management system, was used to build the database. Zope allowed for dynamic data structures and an object-oriented approach, handling changing data definitions. The document also outlines challenges in project management for a geographically distributed team and lessons learned around tools for collaboration.
The technology of the Human Protein Reference Database (draft, 2003)
1. Human Protein
Reference Database
An analysis of the technology
powering the database and website,
and how it was developed.
Kiran Jonnalagadda
2. Facts About HPRD
• HPRD is a database of all disease causing
proteins in the human body.
• It is the most comprehensive database of
its kind in the world today.
• Unlike most other biological databases,
HPRD is protein-centric, not gene-centric.
2
3. Factors Leading to Choice of DB
• The biologists hadn’t settled on what
information was to be stored and therefore
the data type definitions changed often.
• Several data types were fairly similar to
others but not the same.
• Future extensions had to be built by techsavvy biologists with minimal assistance
from programmers.
3
4. What We Used
• The Zope application server, comprising of:
–
–
–
–
The Web publishing object framework.
ZODB, the object database storage system.
ZCatalog, the indexing and search system.
ZEO, the stand-alone database server for
multiple front-end Web servers.
4
5. Why an RDBMS Was Not Suited
• Data type definition changed frequently. In
an RDBMS, this would have meant
redefining tables every week.
• The code currently has about forty data
classes. Imagine having that many data
tables, plus tables for relationships between
them, all under frequent revision.
5
6. How Zope Handled These Issues
• Zope is built on Python, which offers
dynamic data structures.
• ZODB uses this ability to makes the entire
database look like one large data structure,
transparently swapping unused parts to
disk and recovering them as needed.
• ZCatalog indexes data for searching.
6
7. At Zope’s Core is Python
• Python is a dynamic language.
• When I say dynamic, I mean everything is dynamic!
• Code, variables, classes, modules, everything can
be modified at run-time.
• Most of Zope is built around this ability. Zope
could not have been implemented in another
language.
7
8. Data Storage in Zope
• In Zope, data is stored in instances of a data class.
• The data class has variables, which are like fields,
and methods, which manipulate data.
• Instances of a data class (objects) are stored in
the ZODB, making the database.
• Objects can contain other objects, forming
hierarchies.
8
9. Components of Zope
• ZServer (formerly Medusa)
– Handles incoming requests.
– Does HTTP, FTP, WebDAV, XML-RPC; soon SOAP.
• ZPublisher
– Maps URLs to objects and handles security.
• ZODB (Zope Object DataBase)
– Stores objects on disk in a transactional DB.
• ZEO (Zope Enterprise Objects)
– ZODB server for multiple Zope front-end servers.
9
10. Security in Zope
• Security is fine grained.
• Security is defined around four concepts:
– Users, Roles, Permissions and Hierarchies.
• A user is assigned one or more roles.
• A role is assigned a set of permissions.
• This set can be reassigned at different
positions in the hierarchy.
10
11. Security Outside Zope
• Zope’s security mechanism is limited to the
Web front.
• It is applied only to objects that directly
interface with the end-user.
• Code written in a module in the filesystem
has no security restrictions. It can do
anything.
11
12. Limitations in Zope
• The API for creating extensions (called
Products) is complicated and poorly
documented.
• The Property Manager interface is too
primitive. It only handles the very basic data
types such as strings, integers, boolean
fields, selection lists and multi-line text.
12
13. Our Extensions to Zope
• A framework for separating Zope specifics
from our data types, making it much
simpler to add new data types.
• An extended property management system
that could handle changes in data type
definitions and automatically migrate data.
13
15. User Interface Design
• We started with exposing Zope’s hierarchy
as the public user interface
• But there were some elements such as the
category browser and the
15
16. Templates for the Web UI
• Choice of DTML and ZPT for templates.
• ZPT for templating system.
16
17. Part III
Project Management Lessons
What we learnt about managing a
project across continents and distant
time zones.
18. Project Management Issues 1
• We learnt the hard way that a project
manager’s place is with his team, not with
the client.
• Productivity suffers in the absence of an
effective collaboration tool.
• E-mail and instant messengers are not
effective collaboration tools.
18
19. Project Management Issues 2
• Collaboration over e-mail imposes the
burden of articulation on the
communicator, which many dislike and
therefore avoid.
• Instant messaging prevents collecting
thoughts before presenting them and is
therefore a poor planning tool.
19
20. Collaboration Tools
• We experimented with several
collaboration systems, with varying
effectiveness:
–
–
–
–
–
Phone calls.
Instant messengers.
Wikis.
Issue tracking software.
Mailing lists.
20
21. Phone Calls
• Next best thing to face-to-face discussions.
• But only connect two people unless nonstandard equipment is used.
• International calls are usually too expensive
for the resulting gain.
21
22. Instant Messengers
• Provide critical communication between
geographically distributed team members.
• But the pressure of maintaining continuity
in a conversation hinders pausing to gather
thoughts.
• Typing is much slower than talking.
Therefore little else gets done alongside.
22
23. Wikis
• The easy hyperlinking system of a wiki
combined with structured text makes
presenting information a snap.
• With a little code thrown in, Wikis could
make a wonderful project management
tool.
• A changed page notification system is
needed or changes go unnoticed.
23
24. Issue Tracking Software
• We use BugZilla to track issues.
• But in eight months using it, only 30 issues have
been reported using it.
• The other few hundred were reported over email, instant messengers and in person.
• Clearly, the problem is with BugZilla’s usability.
Search for a new system is on.
24
25. Mailing Lists
• E-mail is push media: the latest is always on
top of your inbox.
• E-mail makes an effective to-do list in an
interface the user is comfortable with.
• Mailing lists are e-mail in broadcast mode.
• Mailing lists have been the most effective
collaboration tool we’ve used so far.
25
26. Issues With Programmers
• Programmer skill levels and attitudes vary.
• C programmers tend to write C code in
Python.
• PHP programmers tend to write PHP code
in Python.
• Learning Python is easy but thinking in
Python takes a long time.
26
27. Programming Tools We Used
• CVS for source control.
• ViewCVS for a Web front-end to CVS.
• Vim in GUI mode for source editing
(preferred editor of everyone in the team).
• The print statement for debugging.
27
28. Tools We Should Have Used
• WingIDE is a $35 piece of software that
provides an interactive Python debugger
usable with Zope that would have in a few
minutes of usage more than paid for itself
for the hours in programmer time we
instead spent debugging using the print
statement.
28
29. Part IV
Things Needing Fixing
Mistakes we made during
development, how they affect things
now, and how they can be fixed.
30. Naming Conventions
• We started with assuming HPRD was genecentric and named several things as
GeneSomething.
• In code, this can be considered just an
identifier.
• But in a URL, there is potential for
confusing users and needs renaming.
30
31. Reusable Modules
• All of the code currently sits in one
directory.
• Several important pieces have nothing to
do with how they are being used.
• These modules could be separated and
contributed independently to the open
source code pool.
31
32. Data in Code
• There are bits of implementation specific
data embedded in code in some places,
particularly related to graph generation.
• These were introduced as quick patches
for a temporary problem but have
remained in place for months now.
• These need to be taken out so that the
code is truly reusable.
32
33. Documentation
• DocStrings needed in code.
• Consistent language in DocStrings.
• HTML documentation files to be
distributed with code.
33
Editor's Notes
Insert points here outlining the data requirements of HPRD.
Needs more slides before this explaining the organization of a project in Zope.
Backup for statements on C and PHP programmers:
In a C function, all variables have to be declared first with an explicit data type before they can be used. Variables cannot be declared just before use. C programmers tend to reuse temporary variables in a long function.
A C programmer new to Python will therefore tend to write C code translated into Python. Examples of this coding style are initializing temporary variables to blank values (“” for strings and 0 for integers) and reusing the same variables instead of deleting them and using new ones, or better, writing nested functions.
An example problem caused by this style is when a temporary variable that is used by a part of a long function expecting it to be initialized to a blank value now suddenly contains something else because another part of the function above this area was extended to use the temporary variable and the programmer forgot to reset it after finishing using it. Such bugs can wreak havoc in code that was functioning perfectly before.
The problem with PHP programmers is not as severe. Because PHP’s object orientedness isn’t very good, PHP programmers again tend to write a bunch of functions when they should have defined a new class instead. Same code management problems follow.