Thoughts on Hall D Database Use
-------------------------------
Elliott Wolin
16-Nov-1999
During my talk on Online Strategies at the recent Hall D DAQ/Trigger
Workfest I briefly discussed the use of databases in Hall D. Below I
elaborate on this important topic.
Why Use Databases
-----------------
Databases help to solve two general problems. From the top-down
perspective, they help you manage complexity; e.g. tracking the
analysis state of the large number of files produced, indexing the
numerous sets of calibration constants and snapshots of the detector
configuration, etc.
From the bottoms-up perspective, databases separate the action of
specifying what data you want from the mechanics of actually
retrieving the data from the datastore.
Database Choice
---------------
There are many types of databases we should consider, including
relational (RDBMS), object-oriented (OODBMS), file-system based, and
custom databases. I estimate we have 2 years before we need to make
final decisions. Which one(s) we choose depends on:
1) what we intend to do with them
2) how mature the technology is
3) how much manpower is required to develop and maintain them
Concering point 1, a major question is whether to store the online
event data in a database, as is done by BaBar, LHC experiments, etc.
I note that in their case a small number of interesting events are
embedded in a large number of uninteresting events. In our case, all
hadronic events are interesting.
Less controversial is the use of databases for online configuration
and offline calibration data, as this is routinely done.
Concerning point 2, object-oriented database systems are developing
rapidly, and the state of the art might be well advanced (due to the
efforts of BaBar, RD45, etc.) in two years. Relational database
systems are mature already, and are in widespread use.
Concerning point 3, our guiding principle should be that our choices
reflect the scale of the experiment. The Hall D collaboration is much
smaller than the large HENP collaborations (BaBar, STAR, CDF, D0, LHC
experiments, etc); we can only afford to push the state of the art in
a few carefully chosen areas.
My Prejudices
-------------
I list my prejudices to create a starting point for further
discussions:
1. We should use databases
Aside from the raw data (discussed below), the days of storing
everything in flat files or custom databases are over; we should use
the standard technology, relational or object-oriented, as
appropriate. The database might point to flat files, but our thinking
should be (in Chip Watson's words) "database-centric" from the start.
2. Don't store raw data in OO databases
I don't think we should store our online data in an OO database. I
predict the state of the art will not advance far enough in 2 years to
allow us to devote the required manpower or money (OO databases
probably will still be fairly expensive).
I also don't think OO databases are particularly suited for our raw
data. We use a loose trigger; i.e. our events are "min-bias" events.
After reconstruction an OO database might be appropriate, as our
events can then be classified by final state. Then the notion of
iterating over a special subset of events becomes useful. Will our
online farms perform final reconstruction...?
We might consider using ROOT (a kind of OO database) to store the
data, as well as CODA format, or a custom format.
3. Non-event data
Relational databases are mature, widely available, cheap (MySql is
free), well understood, simple to use, and easy to interface to (c++,
c, perl, java, etc.). They are well suited to holding online
configuration data, calibration constants, logbooks, etc, and are in
widespread use.
OO databases support more complicated data structures, and could work
well for non-event data. Relational database structures (tables) are
easily mapped into objects in an OO database. Access to OODBMS's is
more difficult and fewer access methods are available (perhaps just
c++, although this might change).
I think relational databases should be our nominal choice, but we
should keep abreast of developments in the OO database world.