Cacl, a doctoral student at
University of Michigan in their School of Information Division, wrote a Slashdot
feature addressing the concerns and problems of preserving digital information.
This is an area of study of his, and interesting to read about, and I thought
the article worth saving:
Recently there was an Ask
Slashdot about the the problem of preserving digital material. The basic
idea was that we are creating a massive wealth of digital information, but have
no clear plan for preserving it. What happens to all of those poems I write
when I try to access them for my grandkids? What about the pictures of my kids
I took with that digital camera? Can I still get to them in time to embarrass
them in the future?
Obsolescence of digital media can happen in three different ways:
- Media Decay: Even when magnetic media are kept in dry conditions,
away from sunlight and pollution, and hardly ever accesses they will still
decay. Electrons will wander over the substrate of the media, causing digital
information to become lost. CD-ROMs luckily do not have this same problem
with electron loss. They still are sensitive to sunlight and pollution though.
Many people mentioned last week that distributors of blank CD media often
make claims of an hundred years or more for the duration of their products.
Research seems to indicate the truth is closer to 25 years,which seems like
a long time, until you consider the factors below. Besides, information professionals
often think in terms of centuries rather than decades.
- Hardware obsolescence: Far more dangerous than the degradation of
the actual information container is the loss of machines that can read it.
For instance, the Inter-University Consortium of Political and Social Research
received a bunch of data on old punch cards. The problem was they had no punch
card reader. It took a decent chunk of time, and a good deal of money to eventually
be able to read the data off of these cards, even requiring some old technicians
to come out of retirement to help tweak the system. Hardware extinction is
hardly a foreign topic to Slashdotters. It happens, and as technology increases
its pace of change, it will happen more quickly.
- Software obsolescence: The real stone in the shoe of digital preservation
is obsolescence of the software needed to open the digital document. This
can include drivers, OSS, or plain old application software. We all have piles
of old software that were written for older systems, or come across an old
file the bottom of a drawer where we can't even remember what application
it used.
There are several strategies for preserving digital information. People mentioned
some last week:
- Transmogrification: printing the digital document into an analog
form and preserving the analog copy. An example would be printing out a Web
page and archiving the print of that Web page. This, obviously, takes out
the main strength of a Web document, hyperactivity, and may also ignore important
colour and graphical content. An alternative form of this is the creation
of hardcopy binary that could later be data entered into the computers of
the future. The media suggested have ranged from acid free paper to stainless
steel disks etched with the binary code. The two major problems with this
idea are that any misrepresentation of the binary could have disastrous results
for the renewal of the document, and transformation to hard copy limits the
functionality of many types of digital documents to the point of uselessness.
- Hardware museums: preserving the necessary technology needed to run
the outdated software. There are several weaknesses to this plan. Even hardware
that is carefully maintained breaks and becomes un-usable. In addition, there
is no clear established agency that will be responsible for maintaining these
machines. Spare parts eventually become impossible to find and legacy skills
are required for maintenance. There must be technicians with the requisite
skills to service these preserved machines. Finally, it does not create efficient
use if all possible future users must bottleneck to just a handful of viewing
sites to have access to the information.
- Standards: reliance on industry-wide standardization of formats to
prevent obsolescence. Market place pressures for software produces create
an incentive for a company to differentiate their product from their competitors.
While unrealistic in a capitalistic marketplace, standards such as SGML have
proven successful for large scale digital document repositories, like the
Making of America archive hosted by the
University of Michigan. However, many of these large repositories also receive
information from donors that is not in a standardized format, and do not feel
comfortable turning away those documents.
- Refreshing: moving a digital object from one medium to another. For
instance, transferring information on a floppy disk to a CD-ROM. This definitely
seemed to be the preferred method of most Slashdotters. While this takes care
of degradation and obsolescence of the media, it does not solve the problem
of software obsolescence. A perfectly readable copy of a digital document
is useless if there is not software program available to translate it into
human-readable form.
- Migration: moving the digital document into newer formats. An example
might be taking a Word 95 document and saving it as a Word 97 document. Single
generation leaps are usually not a problem, so large volumes of information
could be saved. Unfortunately, migrations over several generations are often
impossible, as is migrating from a document type that was abandoned, and did
not evolve. Also, information loss is common in migration, and may cause the
document to become unreadable. While this may be the best single method available,
it is very labour intensive, and some knowledge of the nature of documents
would be essential to determining which information containers to migrate.
For instance, often you lose aspects of a document (good and bad) when you
migrate it, but which of those aspects are important?
- Emulation: creating a program that will fake the original behavior
of the environment in which the digital object resided. This is another very
intriguing method that could be used. It's actually already pretty common.
For instance, most processor chips include emulators for lower level processors.
There also already exists on the Internet a very active group of people who
are interested in emulating old computer platforms. Still, we need to do a
lot of research yet on the cost of this method, and what sorts of metadata
are necessary to bundle with the digital object to facilitate its eventual
emulation. Another problem is the intellectual property hassle caused by emulation.
Reverse engineering is a big no no, and there is no point in making the lawyers
rich. This area is actually where Open Source can be of biggest help to preserving
the longevity of different kinds of applications.
Many people in the discussion last week seemed to believe that simple refreshment
or migration of the data would be a sufficient answer to the problem. At a personal
level that may be true, but for anyone responsible for large amounts of digital
information, neither is a completely convincing method. Here are a couple of
reasons why:
- Not all documents are the same- In the digital preservation literature,
most people talk as if all digital information is in ASCII format. Au contraire.
As computing becomes increasingly robust, so do the documents we create. Multimedia
games, three dimensional engineering models, recorded speeches, linked spreadsheets,
virtual museum exhibits and a host of other documents spurred by the development
of the Web have cropped up. How are they going to be affected by migration
to a new environment?
- It's so darned expensive- It's a little gauche to talk about, but the Y2K
bug caused what ended up being a huge migration of digital information. How
much did the US alone spend on that fiasco? $8 billion? For smaller organization
who do not prepare for the preservation of their digital information, the
cost of emergency migrations could cause all sorts of budget trouble.
There is some belief that there is no reason to preserve information at all.
Most of what is created is just tripe anyway, and we should be more focused
on creating content than preserving it. There are two reasons why some sort
of preservation is important. First of all, it is inefficient to recreate information
that already exists. Human energy is better spent on building upon existing
knowledge to create new wisdom. How much do we already spin our wheels as several
people collect the same data? What more could we be doing if we spent the energy
instead on new pursuits? Secondly, there is some data that is irreplaceable.
Which is not to say that we should keep everything. In a traditional archive,
only 1% of documents received are kept. Ninety nine out of one hundred documents
are destroyed for various reasons. A similar ratio is not unreasonable for digital
documents. Consider that 16 billion email messages are sent each day. It seems
ridiculous to keep all of them, but how do we weed out the ones we do want to
keep? Appraisal of digital documents for archival purposes is going to become
a major issue in the not distant future. There are already examples of data
that have been lost, or nearly lost. NASA lost a ton of data off of decayed
tapes. The U.S. Census nearly lost the majority of the data from the 1960 census.
These huge datasets are important for establishing a scientific record that
reveals longitudinal effects.
Increasingly, the record of the human experience is kept in a digital format.
The act of preserving that information is the act of creating the future's past,
the literal reshaping of our world in the eyes of the future. Nobody knows the
best answer yet. There is probably not a single answer that will fit absolutely
all situations. Information professionals are just beginning to do research
in the form of user testing, cost-benefit analysis and modeling to answer some
of the thornier issues raised by the preservation of digital information. There
are things out there worth saving, we just need to figure out the best way to
do it.
Some links of interest in case you would like to read more:
- a really good bibliography of
related sources by Michael Day
- an article by
Jeffrey Rothenberg outlining some of the issues
- a site at Leeds University
with many related links