Library Philosophy and Practice Vol. 7, No. 1 (Fall 2004)

ISSN 1522-0222

Studying the Reader/Researcher Without the Artifact: Digital Problems in the Future History of Books

Dorothy Warner

Associate Professor, Librarian

John Buschman

Professor, Librarian

Rider University Libraries
Rider University
Lawrenceville, New Jersey 08648

(This is an edited and updated version of a paper given at the Society for the History of Authorship, Reading, and Publishing {SHARP} 2002 Conference, University of London, July 11th)


It is salient to begin this article with some examples of fertile and groundbreaking study emanating from the history of the book, reading, and publishing:

  • Robert Darnton brilliantly re-constructed the world-view of 18th century French society from the ground up in his book The Great Cat Massacre. He did so by re-interpreting odd and rare documents such as a printing society’s wage book, a semi-fictional autobiography of a printshop worker, and an odd, obsessively complete “inventory” of the city of Montpellier.[1]

  • Justin Kaplan’s notes in his Library of America edition of Whitman’sLeaves of Grass list eight different editions Whitman produced and edited, the first consisting of twelve poems and a preface, others expanding to four times the length, and then contracting again. Like all of Whitman’s later compilers and editors, Kaplan faced the author’s injunctions declared at various times on the variety of editions, in order to come up with a complete or definitive edition.[2]
  • Wayne Wiegand has studied odd documents of library history like Library Bureau accession/de-accessioning books used in most small American public libraries to record the acquisition of books. Wiegand productively studied the censorship of controversial materials in some of those libraries over a 66-year period using these records.[3]
  • Jonathan Katz[4] and Martin Duberman[5] are scholars who have researched and documented the history of the gay experience in America. Over the course of 25 years, they have examined previously unpublished and overlooked documents discovered through various means: by communication with gay people; by following up on rumor and vaguely remembered diaries and papers; by following obscure trails left in footnotes, much of which was located in privately-owned and only-recently gathered library archival collections.

What do these examples have in common? They represent important and interesting work that could be accomplished because the documents and the publications exist, and they exist primarily because they were printed and reprinted, simply kept somewhere, preserved and archived. The study of reading, books, book production, editing, and the research process posits a very simple assumption: that which has been read, edited, absorbed, used and studied will still exist as an artifact. As Ronald Schuchard wrote, “what interests the scholar ... in the archive [is] the preservation and accessibility of the materials of the creative imagination, the physical materials, including all the detritus, debris, and ephemera of art, biography, history. And the archival preservation of these materials is crucial for the minor as for the major figures of a literary generation”[6] - the very authors, as Michael Winship[7] points out, that most people read the most, after all.

However, the trend toward digitization, promoted by those who want information available instantly and in a “more accessible” format, poses a very fundamental challenge to the essential assumption that those items will exist in future. The dramatic move to exclusive web-distribution of federal and state government information and data in the United States is a good case study of this problem. Essentially, this project has been undertaken without planning or budgeting for archived, permanent and secure (hat is, unaltered) access. A front page story in the New York Times detailed the digitization project in the US Patent Office of 18th and 19th century patents - and the discarding of the original documents. One person did some dumpster diving outside the Office and came up with four original application copies of some of Thomas Edison’s patents.[8] Much of the newly-digitized data is the raw material for scholars in such far-flung subjects as law, the environment, education, demography, and of course economics and business. Data and documents are not in danger only from governmental sources, but in private databases as well. Significant numbers of novels, scientific journals, and publishing records - economic and editorial - to give only a few examples, are now extant largely or exclusively in digital form.


Our profession’s policies note specifically the “threat to information posed by technical obsolescence, the long-term retention of information resident in commercial databases, and the security of library and commercial databases.”[9] However, in the haste to make information available electronically there are few agreed-upon plans for the preservation of digital information and much has already been lost. For example:

  • Most of the data from the Viking mission to Mars no longer exists.[10]

  • The Division of Elections in New Jersey eliminated the web page that gave the previous year’s election lists and results. Concern from those using the information prompted the Division of Elections to begin to retain this information, but the earlier information is gone. Another New Jersey agency created a new web page and eliminated virtually all of the documents that had existed on the earlier page.[11]

  • When the National Archives received data in the mid-seventies from the Census Bureau, it was in a 1960’s then state-of-the-art UNIVAC format. At the time “there were only two UNIVAC computers left in the world: one in Japan and the other housed in the Smithsonian Institute as a museum piece. Heroic and costly rescue efforts recovered much, but not all, of the data.”[12]

  • The computerized data from a New York study mapping land use and environmental data throughout the state was lost. “The study had employed customized computer software that no longer existed when the computer tapes were turned over to the New York State Archives.”[13]

  • With the inauguration of George W. Bush, the White House website was completely changed and all of the Clinton administration’s web collection disappeared overnight. Fortunately, the National Archives and Records Administration (NARA) had begun to preserve the content of the Clinton administration’s contributions to the White House website, although some suspect that information has been lost anyway, since it has been reported that agencies in the Executive Branch were not all successful in complying with NARA preservation requests.[14]
  • “Some historically valuable records may be deleted prematurely. The New Jersey state Department of Labor ... maintains a database of accounting information on each employer’s payroll. Since the department needs the data primarily for enforcing employer contributions of ... taxes, it offloads records seven years after an employer has ceased operating. But historians might well want to use these datafiles for researching patterns of ethnic and gender employment... for example.”[15]

  • More recently, as a result of September 11, there have been requests by the Federal government to destroy specific information deemed as “potentially sensitive,” and in one instance librarians questioned the order to destroy a public water supply database. The CD-ROM was “compiled to help those researching improvements in water supply safety [and] while it contained no analysis of system vulnerabilities, it documented locations of such crucial infrastructure as intake pipes.... Of primary concern is that there may be no way to retrieve electronic documents that are destroyed.”[16]
  • In the 1980s, NARA transferred about 200,000 images and documents on to optical disks - again the state-of-the-art technology of that moment. “[T]he half-life of most computer technology is between three and five years” and it is no longer certain that the disks can still be played because they depend on computer software and hardware that are no longer on the market, according to a NARA specialist.[17]

  • “All federal agencies must now preserve computer files and electronic mail. But it took the Archives two and a half years (and its entire electronic-records staff) just to copy the electronic records of the Reagan White House, [and] they are gibberish as they currently stand,” according to Fynette Eaton, who worked at NARA’s electronic-records center.”[18]

  • A problem with preservation of e-mails is that the e-mail programs “were not written with long-term storage in mind. So, in the current state of technology, the Archives computers must treat each individual e-mail message as a separate file, which has to be opened and closed in order to be copied from one tape to another.”[19]

Problems Beyond Government Information

Nor are these problems limited to government information. The preservation of electronic journals is also a concern for libraries. Wiggins notes the irony of the demise of the Committee on Institutional Cooperation’s CICNet Journal Archive due to lack of funding. For six years, from 1991-1997, the group attempted to archive electronic journals. The archive has vanished. “Ironic, indeed, to lose not a mere collection but an archive whose purpose was to prevent loss of electronic content. How many pioneering e-journals, many of them hosted on now defunct Gopher servers, were lost for eternity?” [20] In a related issue, an attempt to obtain an article beginning on page 415 of a scientific journal revealed that the online version, available via Science Direct, only shows articles in that volume up to page 389. The response to a query to Science Direct was that at least 2% of its electronic journal content is missing.[21]

Winship observed over a decade ago a need to identify, locate, and interpret the primary sources for publishing history.

[T]here has not yet been a systematic attempt to uncover and make available the basic resources.... In America ... we have been very profligate with such material. Very few publishing firms that existed one hundred years ago are still in business today, and it seems safe to say that even fewer publishing archives or records survive from that or earlier periods. This situation makes it imperative that we locate those records [and] we will need to make sure that these sources are preserved for the future....[22]

Given the subsequent media monopolies which control global publishing that Schiffrin[23] and Miller[24] have identified, the preservation of current electronic publishing files, e-mails, and electronic editing, and in some cases digital publishing seems very much in doubt for future scholars of our current literature.

It is clear that we are rushing ahead before we are ready. A Senior Vice President at Elsevier who is an original member of the Task Force on Archiving of Digital Information convened by the Research Libraries Group and the Commission on Preservation and Access in 1994 states that “there is no magic bullet in electronic archiving. Those of us who are spending large chunks of our professional time on the topic know that it will require a lot of trust and good-faith effort to continue to move things forward. It is too important and too expensive to be left to chance.”[25] Another expert is troubled by the suggestion that a magic bullet solution (“a simple, universally applicable, one-time fix”) has even been proposed.[26] Moreover, there is no overall plan for archiving federal government documents that exist only in digital format. Instead each agency determines its own preservation policy. A representative from the Bureau of Labor Statistics (BLS) recently promised a conference audience that all digital information at the BLS would be preserved forever, but will Congress adequately fund BLS to be able to follow through on this guarantee? The Government Printing Office (GPO) has had significant budget cuts at the same time that Congress has given GPO the mandate to cut printing costs by making information available digitally. This, of course, does offer wider access to the information today, but what about tomorrow?

The rush to make information available quickly and widely, often for “future planning” purposes, has overshadowed the need to ensure that the very same information will continue to be available for planners, literary scholars, and historians of the future. The cart is again before the horse in several areas which we will now discuss in brief: standards, costs, digital preservation strategies, reading mechanisms, and the context of digitally preserved information.


There is a vigorous debate over technological and software standards since “no computer technical standards have yet shown any likelihood of lasting forever.”[27] This is an important area since standards “can assist by facilitating the transfer of information between hardware and software platforms as technologies evolve” and “resources which are encoded using open standards have a greater chance of remaining accessible after an extended period than resources encoded with proprietary standards.” Descriptive metadata has no agreed-upon standard. Metadata is defined as: “data about data or information known about the image in order to provide access to the image. This usually includes information about the intellectual content of the image, digital representation data, and security or rights management information.”[28] Typical metadata standards are US MARC and the emerging scheme, Dublin Core. Research is being conducted to attempt to develop a uniform standard which must exist for any of the electronic preservation models to succeed.[29]


Cost considerations are substantial.

One clear message that has emerged is that a great deal of money can be wasted if digitization projects are undertaken without due regard to long-term preservation. It is now relatively easy to produce digital versions of texts or images. However, if there is no plan in place for archiving the digital files, long-term preservation will be expensive, or may even result in the work having to be repeated.”[30]

The Yale University Libraries Project Open Book, studied the costs of converting into digital image the printed text and accompanying materials in 10,000 brittle books.

[I]nvestigators expected to find that both digital storage and access costs would be cheaper than the costs of storage and access in a traditional paper-based library. However, the results of the study showed that unit costs for storage were more than 12 times higher, and for access 50% higher in the digital archive than in the traditional library. These results were true in the first year of operation and continued to be true for storage costs, though to a lesser degree projected over ten years, even when staff and overhead costs for the traditional library were taken into consideration.[31]

Digital Preservation Strategies

In international discussions regarding archiving issues there is a presumption that for online journals, migration will be the digital preservation methodology of choice. Migration is defined as the “periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to a subsequent generation.”[32] For example, the information on a floppy disk may be transferred to a CD-ROM format, offering only a temporary preservation since the CD-ROM format must then be migrated when the technology changes again. However, a great number of questions still need to be answered and “until those questions are resolved, libraries will be understandably reluctant to make a permanent switch from paper to electronic collections. What should be archived and in what format? How many copies of the archive are needed? Who holds those copies? What is the access to the archive and who controls that access? How does licensing affect archive building? What can the scholarly community afford?”[33]

The digital information must be refreshed without changing it and in a new operating environment the copy is not exactly the same as the original, requiring decisions about the aspects that need to be preserved. Metadata can assist here in providing information about migrations and the effect on the digital object. In some cases, software that is “backwards compatible” can simplify the migration process (the most recent version of the software having the capability of decoding the files created in the earlier version). However, there is no guarantee as to the compatibility over time as technological developments become increasingly complex and/or it is no longer financially worthwhile for a software manufacturer to support such compatibilities. Some question the practicality of migration while some point out that each new format will require a unique solution. The most extreme (and ironic) version of this is the preservation on paper or preservation quality microfilm. It is worth noting the obvious again: archival quality paper or microfilm record can last up to 500 years.[34] However, the disadvantage of preserving a digital record on print or microfilm is that the record may not be able to adequately represent the original object since the digital functionality of the resource can be destroyed, like the computation capabilities, graphic display or indexing , equations embedded in a spreadsheet, and the impossibility of printing out an interactive full motion video or preserving a multimedia document as a “flat file”. Concerns over data loss and the loss of functionality or the “look and feel” of the original platform are still of a concern regarding the migration method.

Reading Mechanisms

Clifford Stoll has described one of the other primary problems previously alluded to: “electronic media aren't archival [and] the physical medium isn't the problem. It's the reading mechanism.” He goes on to give many examples of the now-extinct formats and the machines that read them: 78-rpm records, 8-track tapes, 100-column punch cards, and 5-inch glass lantern slides. Further, there is an equally impressive list of soon-to-disappear formats and readers like Betamax tapes, and single-side, single density diskettes. As Stoll notes, the information contained in these formats may be perfectly good and workable, “but they become increasingly expensive to read, as equipment becomes expensive to maintain or simply cannot be repaired.”[35] Libraries and archives all over are slipping and sliding toward exactly this problem: the replication of the information into a more current format is very expensive and this promises to further strain library budgets - exactly what the National Archives faced in converting UNIVAC-stored Census information. Because of the concern of potential technological obsolescence, there is a substantial amount of printing taking place of electronic government documents as lengthy as 500 pages (both state and federal) both by libraries and by end-users. Under such a regime, furthercosts are transferred to libraries and archives.

Context of Digitally Preserved Information

Kenneth Thibodeau of the National Archives expresses concern on behalf of future researchers about current digital preservation methods. The Archive’s responsibility is to “preserve and deliver authentic records to subsequent generations of users.” A connection needs to exist between an historical record and the activities in which they are made and received. If this link is broken, corrupted, or even obscured, the information in the record may be preserved, but the record itself is lost. This fundamental difference between records and documents can be readily illustrated empirically. For example, a map of Sarajevo is a document, but a map of Sarajevo known to have been used in making a targeting decision that led to the bombing of the Chinese Embassy is an essential record of that action. The key difference between the document and the record is the specification of the context of action in which the record was involved. To preserve authentic records entails preserving the documents themselves and also their connections to the activities in which they were used.[36]


To conclude, our profession expresses bedrock principles that have become fundamental to our concept of reading and research:

Now as always in our history, books are among our greatest instruments of freedom [and] they are essential to the extended discussion which serious thought requires, and to the accumulation of knowledge and ideas in organized collections. [F]ree communication is essential to the preservation of a free society and a creative culture [and] the range and variety of inquiry and expression on which our democracy and our culture depend.... [T]he preservation of library resources is essential to protect the public's right to the free flow of information. [37]

As Wiegand notes, our admittedly biased and flawed classification schemes devised over centuries still “constitute one of the few bridges available to all who use them to help link the separate islands of discourse.... What we do constitutes [an inherent] challenge to that power when we facilitate access by organizing information.... Capitalism doesn’t necessarily appreciate this; democracy does.”[38] It is not enough to collect and save this output, we must make it available to people, to researchers, and to the future.

That legacy is in some danger. A chilling report from a division of the American Library Association in 1977 stated that

As a consequence of . . . information overload, the role of libraries for several thousand years, which emphasizes the preservation of the human record, has now become more complex, requiring hard decisions not only about what is to be preserved but also about what is to be discarded. Decisions are, and must, be made to erase portions of the record deemed to be insignificant, irrelevant, and unrepresentative, in order that the useful and pertinent be accessible.[39]

Perhaps most famously, Nicholson Baker has blown the whistle on wholesale dumping of collections in the building of the new San Francisco Public Library, the disregard for the valuable and irreplaceable information (like usage, provenance if the item was a gift, and notations) contained in the discarded Harvard University Library (and other research library) catalog cards, and of course the dumping of the last copies of original 19th and early 20th century American newspapers. Baker has charged - credibly - that US. libraries have “abandoned their duty” to preservation.[40]Our profession’s uncritical, unthinking enthusiasm for technologies has led us to overlook significant problems with electronic resources in regards to the issue of preservation.

The problem was stated by O’Mahony, whose specific concern was about electronic government information, but that concern certainly relates to other forms of digital information:

Each day that the problems of electronic preservation and permanent public access go unresolved, alarming amounts of government information continue to be lost as databases come and go from agency websites, files are deleted from government computer servers, digital storage media deteriorate, and hardware and software become obsolete. The continuous and cumulative effects of this ongoing catastrophe are to ... impair the public’s ability to use government information already collected and compiled, to waste public and private resources in having to duplicate efforts to retrieve information previously available but now lost, and to allow the historical record of the nation to literally vanish before our eyes. Moreover, it severely undermines the potential promise and usefulness of new electronic technologies when the long-term consequence of their use is an ever-widening breach in our collected knowledge and information bank.[41]

There are fundamental issues at stake for libraries and digitized archives. A true archive “shouldn't depend on duplication for preservation.”[42] While expressing gratitude to libraries for digital and microfilming preservation efforts, the Modern Language Association states that “the advantages of the new forms . . . cannot fully substitute for the actual physical objects in which those earlier texts were embodied at particular times in the past . . . . All objects purporting to present the same text . . . all carry different information, even if the words and punctuation are identical....”[43] Eugene Provenzo writes that “anyone who has used a word-processing system . . . knows how easy it is to transform information in a digital context. One word can be automatically substituted for another, a name changed, a date altered, an idea corrupted without any record of what the original source said. [This] represents a major problem in terms of the integrity of historical documents, and the extent to which we can trust the information from such sources in the future.”[44]

One of the great ironies of the information age is that, while the late twentieth century will undoubtedly record more data than have been recorded at any other time in history, it will also almost certainly lose more information than has been lost in any previous era. A study done in 1996 by the Archives concluded that at current staff levels it would take approximately a hundred and twenty years to transfer the backlog of nontextual material (photographs, videos, film, audiotape, and microfilm) onto a more stable format.... There also appears to be a direct relationship between the newness of a technology and its fragility.... A librarian at Yale University has created a graph going back to ancient Mesopotamia which shows that, while the quantity of information being saved has increased exponentially, the durability of media has decreased almost as rapidly.[45]

Consider once more the example of researching the American gay experience noted at the beginning of this paper. Personal communication and footnotes pointed toward both private and library archival collections, but if they existed originally in electronic form, where would they be today? Would an individual, organization, library or archive have taken the time to archive them, given the costs of constantly upgrading the archive to the newest digital format? And, even if this had been attempted, how would a researcher discover them? As researchers today persist in leafing through often disorganized boxes of print collections in an archive searching for clues, where would a researcher locate something perhaps considered to be ephemera at the time of its inception, yet an invaluable clue for a later historian? A colleague notes that mid-20th century hymn collections are less likely to be found in library collections than 17th century volumes. “In a century known for the ‘information explosion,’ when new technologies revolutionized printing, perhaps ephemera can only be valued in hindsight.”[46] Likely, no indexing to ephemera would exist, and most likely this particular documentation of gay or sacred music history would be invisible to the researcher if it did exist in electronic form. It may even have been deleted from electronic existence many years before. If the researcher is willing to take the time to locate information stored in digital form and access it in the particular electronic state that it is in, at what cost of time is the researcher missing the “opportunities for study and careful concentration” of the information discovered? One scholar suggests that “time devoted to finding comes at the expense of time for reading.”[47]

We are nearing a time when we will bequeath a scholarly record that will be akin to the study art history only through the descriptions of the critical literature, but without the original artifact. Neal Postman has argued that we have “embarked on a great uncontrolled experiment which involves submitting all of our institutions to the sovereignty of these new media [and they are] winning the competition with typography for the time, attention, and cognitive predispositions” of people.[48] This process of redefinition - driven in large part by electronic resources - is not without serious problems for research, archives, libraries and our concept of research and reading. In the immediate sense, we are gravely concerned that the excitement of mere technical possibility and convenience is undermining the existence of important documentation in the future.


