postings regarding digital persistence

Stewart Brand asked me to write some thoughts to a mail group for people interested in the Clock Library. A very short contraction of these comments was published in Wired. Here's the posting from which this contraction was derived:

We used to think of digital documents as eternal, because you could clone them. As long as even one clone survived somewhere, the information would not be lost. This was one of the main reasons to be excited about things digital, because entropy is gradually erasing everything analog. Every copy of every film, LP, and photograph is gradually disappearing. If digital hadn't come along, huge pieces of our history would eventually vanish. But it's not so simple in practice.

I was asked last year by a museum to display an art video game (Moondust) that I had written in 1982. It ran on a computer known as the Commodore 64 that had already sold in the millions by the time of the game's release. It turns out that after my game cartridge was introduced, there was a slight hardware change to the computer (in 1983), which caused the sound to not work. So I had to find a 1982 Commodore 64. There were also compatibility problems with the video interface box and joystick. It took months to find a working set of parts. All this trouble with a machine who's operating system was fixed in ROM and had been available at the time in the millions!

We might imagine that we have become less vulnerable to this process of digital environment loss, but we shouldn't be so confident. Even if there are millions of examples of a content delivery machine, it will eventually become difficult, and then ultimately impossible to reconstruct the moment when an interactive work matched up with its hardware and software environment.

How long can a digital document remain intelligible in an archive? There are two kinds of digital artifacts that can be roughly distinguished in this regard, though the division is blurry. The first kind is the more traditional non-interactive document, which consists of media that will be displayed without deep modification during normal use. The order in which the material is perceived might change, fonts might change, pagination might change, color space might be distorted; but there is an immobile core of content. Texts, images, moving images, sounds, and potentially textures, tastes and other records of sensory experience fall into this category. These documents can typically be transferred into alternate formats, but often with a loss of design integrity.

The second type of document is a more recent arrival. This is a deeply interactive document, such as a CD-ROM, video game, highly JAVA-ised web page, interactive or generative music work, or, for the worst possible case, a virtual world. In this case the interactivity is core to the content. The specification of the interactivity is only intelligible in a reconstruction of the hardware and software environment in which it was written. (If the time domain is not important to a particular interactive document, then the hardware environment becomes less vital and more abstract; only then does the strategy pure software emulation become sufficient- but the time domain is likely to become more and more important. An example of a design where the time domain is not critical is a command line interface.)

It is impossible to keep a hardware/software environment for a type 2 document constant for extended periods of time. With each incremental change must come corresponding incremental changes to interactive works that run in the environment. A dormant document might be revived in a world in which it is no longer intelligible. There will probably be no end to the exploration of new hardware/software environments in which interactive designs can be expressed. This is a new art form in its own right, that is generating its own loving momentum- it is unstoppable. So it would not be wise to plan on the appearance of a stable, ultimate standard metaformat for interactivity.

My sense is that deeply interactive works (type 2) can usually still be resuscitated (though it isn't effortless) after about 3 environment generations. Less interactive type 1 docs that are in popular formats can potentially survive much longer- maybe dozens or even hundreds of generations. The danger with type 1 docs is the illusion that longevity equals immortality. Because of this illusion, type 1 docs might demand even more vigilance in the long term.

I'm doing one test with type 1 docs now. I've declined to update my copy of microsoft word since version 5.1. The main reason to update is to maintain compatibility with everyone else who has- but that seems like such a scam for msoft. I suppose the fashion industry pulls the same scam on their customers all the time, but in that case it's at least fun for those who care. I ask everyone to devolve files they send me to rev5 and I send out v5 files to others. How long will this be possible? My guess is about 10 years. And then a lot of texts will start to vanish- though they won't vanish all at once; v5 mword files can still be figured out to some degree with an ascii viewer. In as soon as ten years, if nothing is done to encourage better practice, we might find that some archival texts have lost their italics and artful pagination.

Ascii itself would seem to be a likely candidate for a long-lived format- although in practice it has to be housed in less long-lived delivery formats- such as hard disks readable by varied OSs. As hard as it is to imagine now, even ascii will inevitably start to drift, though most probably not in our lifetimes.

We can already observe type 2 documents decaying at a faster rate. I've designed many virtual worlds in the software made by my old company VPL, as did a lot of other people. Since that software hasn't been supported for quite a while, many of us have tried to transfer our v-worlds to other software packages. The v-worlds in question are significantly complex, such as a brain surgery planner. The transfer process turns out to be impossible. The VPL software had a particular way of organizing interactivity that has thus far proven to be too hard to emulate on other systems. The irony and difficulty is that it is exactly those kinds of ahead-of-their-time, influential interactive systems which are the most important to preserve- and the hardest.

But the problem is just as real for mainstream data. The management of old software is in general a huge multi-billion dollar problem. We used to call it a "crisis" in the biz, but stopped using that alarmist term because of its implied optimism; a crisis has a resolution, but the software maintenance problem appears to be with us for the long term.

There are some, such as Erik Drexler, who believe that fast computers and intelligent algorithms will eventually be able to untangle such problems, but I am skeptical. The massiveness of the digital archival puzzle will increase at an ever greater rate- and there are theoretical restrictions on how much better untangling algorithms can become.

In much the same way that it is unlikely that scientists will ever come to the final conclusion of the process of empirical inquiry, attaining perfect knowledge of the universe, it is also unlikely that complex, deeply interactive systems will ever be documented to the point of perfect completion by "metadata".

"Unlikely" does not mean "impossible". Many computer scientists have attempted to find a general solution to the problem. I've spent years on it myself. I think of it as the Everest of computer science.

Another approach is to archive the interactive hardware/software environments, but it's easy to underestimate the difficulty of doing this. It's expensive- you have to buy the machines at their peak value and devote floor space to them. Even worse, you'd have to keep copies of every hardware/software COMBINATION in working order but without changing a single detail.

So, what to do? The usual idea of how to keep history from disappearing is to maintain archives, but what good is that if the information becomes unintelligible? My thought is to keep the global archive of digital information in artificial perpetual use. I'm usually a skeptic of artificial intelligence applications, but here is one that makes sense to me. I thought about this originally after considering the impressive persistence of the Hebrew Torah, which is read and copied constantly, not merely on an as-needed basis. Every document, even ones that haven't been seen by human eyes in five hundred years, can be kept alive by being constantly exercised by autonomous programs.

The result would be a continuous, verified match of works to their environments. In some cases the environment would be kept more stable than it otherwise would have been. This is what Apple does with the Macintosh- testing a large number of applications on each incremental change to the hardware and software. In other cases, the interactive works could be maintained to keep working in a changing environment, though of course that will damage their archival authenticity. And perhaps we will get good enough at computer science to have some of the maintenance take place automatically.

I fear a world in which older books that haven't been digitized become invisible, and newer interactive works also become lost due to environmental drift. The Clock Library Project and the Getty Museum are working together to find solutions to these problems. These and other similar projects are vital; nothing robs us of meaning and wisdom more than the loss of collective memory.

Go back to Jaron's home page.