Geo-referenced Digital Libraries: Experienced Problems of Purpose and InfrastructureComputing Science Department University of Aberdeen Kings College Aberdeen, Scotland
In a panel discussion at the 5th European Conference of Digital Libraries, Freeston (2001, p. 458) challenged, "what's holding up the development of geo-referenced DLs [digital libraries] in advancing beyond collections of digital maps?" Is it, he continued,
In the same year the Scottish Higher Education Funding Council (SHEFC) awarded £688,000 to Aberdeen Geo-referenced Digital Library Project (Aberdeen-ADL) in Scotland (SHEFC, 2001) leading us to collaborate withAlexandria Digital Library (ADL) in the US.
It has been two years since then and, what is now holding up development in Aberdeen-ADL project? Research collaboration in coping with the complexity in the development is critical. Indeed, it is known that projects are often based on key funding initiatives. Over the past ten years digital library (DL) programs have been actively pursued in the US and the UK. As Rusbridge (1998, p.1) explains, projects in the two countries have been very different:
By contrast, the eLib program (funded by the UK's Joint Information Systems Committee [JISC]) characterized itself right from the start as "development" rather than research. JISC does not fund research in the same way the National Science Foundation (NSF) does (or, for example, the UK Research Councils do). Rather, the mission of the JISC is to stimulate and enable the cost effective exploitation of information systems and to provide a high quality national network infrastructure for the UK higher education and research councils communities.
Most recently, such discrepancies seem to remain pandemic. See, e.g.,Greenstein (2002) conducted a survey of the digital library biography in the US; The NSDL (2003) intends to grow into the world's largest digital library of science, technology, engineering, and mathematics resources and services for education. However, this paper is not going to address how to establish a unified framework on interest of more collaborations that many individual researchers have proved to be absolutely crucial, see, e.g., Peterson (2001), Parry (2003), Smiths (2000), Poland (2000), Bunker (1999), to name just a few.
Rather, the author's experiences warrant great concern that a strategic need alone is insufficient justification to launch the project; We have the problem of purpose, as Levy (2000) and Williams (1988) would say, and a problem of infrastructural commitment and a plan that enables us to foresee a reasonable likelihood of resources to establish a scalable infrastructure, so the research can be continued.
We shall introduce the project in question in section 2. Before reporting the lessons learned, we shall explain a few infrastructural terms from the computing perspective in section 3. We shall then understand in section 4 that, ADL is essentially concerned with infrastructural problems and currently holds no content of the maps collected; Aberdeen ADL is based on strong research collaborations in areas of large databases and spatial image processing; We are concerned with the contents and their usages, but not the infrastructures, as infrastructures are not something we traditionally study. We shall also highlight our current and future work in section 5, and finally give conclusions in section 6.
Although it is rare to report a project negatively, reporting lessons from a failure should be encouraged, whether from an author's own efforts or observations, in just as reflective and honest a manner as we might report success. The author believes that as much as one wants to report and hear about success, the truth is that we also often learn more from our failures. Knowing what not to do, and how to spot the warning signs when things begin to go wrong, can be vital skills. Also, lessons from failure can provide greater opportunities for learning and extracting knowledge on failed experiences.
In May 2000 SHEFC funded Aberdeen Digital Library Project (Aberdeen-ADL) for a strategy enabling us to collaborate with Alexandria Digital Library (ADL) and to investigate various issues surrounding the establishments of transferring digital library technologies from the US for collecting geo-referenced electronic items such as digital maps.
ADL is a well known system, a product out of the six digital library projects funded by NSF, DARPA, and NASA in 1995. Its collection and services focus on geographical information: maps, images, geo-referenced data sets with text, and other information sources with links to geographic locations. Now ADL has been evolved to be ADEPT (Alexandria Digital Earth Prototype) by the second phase of the development (from 1999 to 2004) expanding ADL-usabilities into new fields, e.g., classroom based geo-referencing e-learning applications.
In Aberdeen Scotland wide collaborations with our Scottish research partners have been established in areas of dynamic visualization, archiving methods, and landscape image processing based on our leading research in very large databases. The overall research strategy is the focus for establishing significant international collaboration between Aberdeen/(Scotland) and Santa Barbra (California) in the creation and standardization of geo-referenced information sources and services on the Web. We have brought the generous support of SUN Microsystems, who, through a combination of a 50% contribution on top of a 30% academic discount, have effectively offered almost £1M worth of hardware-the terabyte machine at not much more than a third of the standard price. But why this strategic vision just described is having difficulties to launch the project as it was originally proposed? To explain this, we need to review a few technical terms so as to put this argument in the right context.
3. Definition of Terms
3.1. Digital library
In respect to the flow of information attributable to knowledge (Dretske,1981), the aim to develop a digital library system (Chen, 1998) is not different from the purpose of having a system of traditional library (Kemp, 1976):
The two, however, clearly distinguish themselves by completely different technology origins-one consists of foremost printed matters with printing technology in origin; While the other is composed of foremost items existing only in digital forms in an origin of a computing technology.
For convenience of our discussion, we regard the common set of purposes as library appearances.
A digital library is an organization of digital items providing online services exchanging something knowledgeable about the items for a community.
A digital library is a network of computer based systems in library appearances; how institutional it appears to be depends on what we want it to be. Currently most of recent advances in digital libraries mimic traditional library appearances which have provided many new opportunities as well as problems in information science research (Harter, 1997, Collier, 1997, Campbell, 2000, Keller, 2001, LIBER, 2002), such as research in organizing digitized and archived materials to preserve our cultural experiences, knowledge and treasures that we often find in art galleries, libraries, or museums, or digital publications through gateways, repositories, etc..
But, while collecting a huge amount of observational data revealing the knowledge in subjects of earth sciences, space sciences, the environment, biology, medicine, etc which are either generated today, or have been accumulated over the past centuries, we also are in great need for a network of organizations; Do we not call them digital libraries?
3.2 Digital items-they have various technology origins and information orientations
Digital items live on computing environments, are much less structured, and can have different technological origins and informational orientations for their own existence.
A digital map is such an item. Its technological origin can be from a digitalized board, a satellite, a scanner, a sensor, a data model or a combination of them. Its informational orientation is much broader than that of a printed map.
A printed map's generation is a process of transforming the real world into a map that is considered as the cartographical conceptualization and visualization of reality. It is also known as cartographical abstraction. It is the physical reality of our world compressed in a symbolic way. Because of the nature of printing technology, a printed map has its artistic characters-what is abstracted on a map is like a caricature in which certain features are emphasized and others are not. For example, roads maps present themselves differently from city maps.
Similarly, a digitalized map is a geo-referenced item only in a digital form; It is full of digital artifacts of bounding rectangles and complex chains of coordinates representing complex footprints of maps, datasets, and environmental diagrams; It is computing system dependent. Because of this, a whole range of possibilities have been added to inter-operate a digital map with images, geo-data sets, hyperlinked documents, numerical models, videos, audios, and/or CAD drawings, to overall to enrich new information embedded in a digital map which can therefore be retrieved visually via the maps. We recognized that this new information orientation being augmented to digital maps is fundamentally challenging the way we organize the maps by means of computing.
3.3. Geo-referenced digital library
Geo-references are defined by positions with respect to an origin in latitude and longitude angles. Point locations on the Earth and other bodies are described with such a spatial reference system.
A geo-referenced digital library is a system organizing the Earth referenced electronic items with its online services for communities.
Current geographical information systems (GIS) handle large amounts of geographical data stored usually in relational databases. Database vendors developed special database plug-ins in order to make retrieval of geographical data more efficient. Basically, they implement spatial indexing techniques aimed at speeding-up spatial query processing. This approach is suitable for those spatial queries, which select items in certain user-defined areas. As online transaction processing systems evolved into online analytical processing systems for supporting more complicated analytical tasks, similar evolution can be expected in the context of geographical information analytical processing. However, a spatial database research online with the analytical processes is not a digital library; It needs to be integrated into the technical infrastructure for library appearances mentioned at the beginning. What are these technical infrastructures?
3.4. Digital library infrastructures
Infrastructure seems to be singularly boring as an object for scientists to study (Star, 2002),. It is often referred to as a list of technical specifications, black boxes, places, wires, plugs, roads, bridges, stations, etc. Infrastructuring is usually seen as an engineering work to establish public services and utilities for social communities. Roads, railways, bridges, pipe lines, electricity, etc. are instances of social infrastructures. Because of the world's technical sound, people now use the term infrastructure to refer to any substructure or underlying structure of systems-most notably the information superhighway-the global information and communication infrastructure of networks that include the Internet, WWW, telephone networks, cable or satellite communication networks. This is the backbone infrastructure for a digital library-the web-driven and network centric electronic infrastructure. Not all backbone infrastructural elements directly attribute to a digital library, but some of them need to be systematically studied at early stage in the research and development. Web service is such a component. A digital library also needs other basic infrastructures: semantics infrastructure, protection infrastructure, preservation infrastructure, user infrastructure, and collaboration infrastructure, see, e.g., (Chen 1998, pp 97-183), which we shall study elsewhere.
3.5. Web service components
Web services are the latest software components and technologies designed to bring us online for "what we need, when we need it via any device we choose and access." With a web service, you run or interact with your application without the application at your machine. This calls for us to make more infrastructural effort. Indeed, to provide Web services for geo-referenced digital libraries, we have more sophisticated end-users and therefore need a set of more dedicated Web services. But for now, let us look at another two simpler services that are sufficient enough for our illustrating purpose (Queue, 2003).
Table 1. IBM phone book web service
The two scenarios listed in Table 1 & 2. show how an infrastructure determines the scalability of a given application. A common set of facilities provided by each service provides a significant degree of difference in our research focus on a resource, and therefore distracts us from our research mission and methodology.
In the IBM case, the classic client-server-DB approach is suitable to well structured data services; While the Sun case is asking for standards and open services-more platform portability vs. front-end and back-end couplings, more inter-operations between applications vs. databases mediation, more resource transformation engines vs. data-bits transaction servers.
Table 2. Sun Microsystems cost-effective web service
4. Lessons Learned
The complexity of the digital library stems not only at the heart of classifying digital items that we intend to collect, but also at the core of system functions-discovering, distributing, indexing, cataloguing, storing, retrieving, etc, and furthermore, at layers of social and technical communication infrastructures and at tiers of distributed software architectures. Also the infrastructures enable us to adapt to new technologies, new methods and new standards once they become available as well as accommodate potential research partners once they are identified. Otherwise software developments cannot be continued.
There are three types of lessons learned in launching the present project: 1) technical infrastructure lessons, 2) research methodological lessons, 3) project managerial lessons.
4.1 Technical infrastructural lessons
As we have very briefly introduced in the previous section, a geo-referenced digital library is typically a large collection of geo-referenced items organized and maintained in digital formats within complex Web based service infrastructures, so an end-user can access and explore the items. ADL has proved to be very intensive in terms of finance, human resources, and technology investment.
ADL has three successive versions:
Now ADL requires substantial modifications (Janee, 2002). The problems are described by the following three categories, S-Category, O-category, and I-Category in Table1, 2 & 3 respectively.
Furthermore, the infrastructure does not address holding content at all. Here it is a belief that S-Tags-Semantics-Problem in Table 3 is not entirely responsible for holding content, because more problems described in Table 2 & 3 are significantly inter-related to the holding function too.
Table 3. S-Category
Table 4. O-category
Table 5. I-Category
4.2. Research Methodological Lessons
A lack of system analysis and infrastructure has clearly created the following barriers confronting the potential research and development:
This strategy of technology transfer has been taken simultaneously with the ADL fast prototyping approach at UCSB (University of California at Santa Barbara). The entire Aberdeen-ADL development has been regarded as a choice of programming language, method, and/or technique; Little has been done to treat system infrastructure as an object to study and develop.
Digital libraries have to cope with online-raw data which are semi-structured and file based. Specifically, geo-referenced electronic data are formatted datasets, file-based and operated instrumentally. The evolution of a geo-referenced scientific data archive has to cope with raw datasets usually acquired or systematically derived from a scientific simulation, modeling environment, or data instruments. In addition to textual information, which has been the primary focus of digital libraries until now, raw scientific data collections should be emphasized as well, for a more direct impact on scientific experimentation (i.e., technology origins introduced in section 3.2.).
Not only scientific data, but the scientific processes themselves should become part of digital libraries. In particular, simulation models should be stored in digital Libraries and become available through them, either as a commodity or as a service. Scientists should be able to compose these in meaning scientific workflows, feed them with appropriate data, and run the corresponding experiments, all as part of interacting with a digital library. Thus, the entire spectrum of scientific discovery, from initial conception of ideas, to experimental exploration, to publication of the final results will be served through digital libraries (i.e., information orientations introduced in section 3.2). Combinations of text, video, audio, images, structured data, and other forms Digital libraries should become able to manage all available forms of information in an integrated fashion to support the needs of their users. So far, much effort has been put into building mono-media digital libraries (text, video and audio). In the near future significant effort should be devoted to building truly multi-media digital libraries as very few of the on-going projects deal with this issue.
4.3. Project Managerial Lessons
Make a reasonable commitment.
A reasonable commitment must be made to conduct some necessary project processes. In 1998, JISC projects were asking to be concentrated at the near-market, i.e., practical application end of the spectrum; But even now in 2003, Sun Microsystems, one of market leaders in digital libraries for educational applications published its white paper, where 54 questions are raised with regard to digital library developments, and the first one is, "Do You Really Want to Do This?" (SUN, 2003).
How well technological innovations are adopted or adapted is not a just matter of downloading a copy of a software package and making it run on our machine; It also depends on how well the technologies fit our established academic environment enabling us to undertake the further research. Following all the necessary processes in launching a project, we often need to re-infrastructure it when there is clear evidence showing that either the plan originally made has become too complex to pursue or a vision for digital library cannot be shared via research collaborations. In any specific cases, a commitment can be shown by discussing the emerging problems openly with project members or making a report to provoke more discussions even if something has to be reported negatively.
Make an alternative plan.
A project is often managed in a technical, political and financial environment that is less desirable than when the project was designed. This is because the structures in which universities exist are often very fluid. Questions must be asked: "Why all of this interest and activity? Did an urgent research and development problem lead to large amounts of grant funding? Did the availability of grant funding create opportunities for a new research ear? Did successful research lead to practical developments? Did practical problems lead to research on solutions? Is digital library research and practice a definable area of interest, or has 'digital library' merely become an umbrella term for a wide array of information and technology projects?" (Borgman,1999).
An alternative plan must be made to cohere the existing research competence with whatever resources are available. Such a plan should show that some challenges, issues or tensions have been identified, so to provoke more research interests that would stimulate the strategic initiatives for which the project has been funded.
5. Future Work
5.1. Current main objectives
The main objective is to establish the basic file-based interoperations among several infrastructural components such as a meta-data standard, a Web server, an application server, a service middleware, a protocol, a portal, a remote client and a peer computer. This is to be accomplished by
The experiment will pay particular attentions on the following two impacts:
Other impacts will eventually be considered. They are:
Treating storage as an abstract entity with networked services Here a storage system is treated not as a physical device, but as an abstract entity whose behavior can be defined in terms of its ability to store and retrieve file-based electronic objects defined above. This includes meta-data in databases, files in file systems, and files in archives. Various instantiations of a storage service will differ according to their naming conventions and the range of requests that they support. For example, one might support personalizing mechanisms, another might not; According to their internal behaviors, a dumb storage system might simply put files on the first available disk, while a smart storage system might use techniques and stripe files for high performance on observed access patterns.
Storage usages can differ according to their physical architecture, e.g., disk farm vs. hierarchical storage system, in respect to a role in a digital library. For example, fast temporary online storage may be more suitable than archival storage. Storage usages can also differ according to the local policies that govern who is allowed to use them and what file allocation policies are at local storages. For example, an "exclusive" storage service might guarantee that files are only created or deleted; Space is reserved only as a result of external requests; While in the case of a "shared" resource these guarantees might not be held, but a storage system must guarantee that files are not deleted while in use.
In digital libraries, all resources are not ephemeral. The issue of long-term persistence for digital libraries is being addressed by the Persistent Archive community. Digital libraries also are concerned with the management of digital items while they are being manipulated and circulated as electronic objects. The associated time scales are measured in years (archives), weeks (disk storage), and hours (network caches), so are the length of some electronic application objects.
Basic Networked Storage Functions
Within the context of Aberdeen-ADL it is important to reach consensus on a small set of "standard" storage system behaviors, with each standard behavior being characterized by a set of required and optional capabilities. Regardless of the type of storage system, we can establish some basic capabilities that any storage system should have. These required functions are:
Basic Networked Storage Behavior
From the above, we can identify three specific storage system behaviors.
Disk based storage:
Its behavior captures traditional file based systems. The systems are characterized by hierarchical file namespace, and essentially constant latency for file access. In addition to this, the following additional optional behaviors also exist:
Its behavior is similar to that of a disk based storage, with the exception of a local operational policy that does not guarantee longevity of its data. Scratch storage and network based storage element (data caches) fall into this category. The behaviors are expected to:
The behavior of this kind of storage is an online archival storage system that provides secure remote access to large amounts of data, while also providing higher longevity for its data. Files may or may not be named by hierarchal storage system. Specialized optional behaviors for these storage systems include:
A request for a data object will require staging of the associated container to a disk cache.
A strategic collaboration alone is insufficient justification to launch a geo-referenced digital library project, in particular when one's resources are incompatible to its collaborative counterparts. It may be very difficult to organize and coordinate members' research interests, face various organizational issues or have institutional support received from the item-collections or practitioners. But a lack of research commitment and essential technical infrastructure is the real problem in the multi-disciplinary research and collaboration.
Digital libraries have to cope with online-raw data which are semi-structured and file based. Geo-referenced electronic data are formatted datasets and file-based in system operation. The evolution of a geo-referenced scientific data archive has to cope with raw datasets usually acquired or systematically derived from a scientific simulation or modeling environment. A considerable degree of infrastructural transparency between computing system components and their integration with users' workplaces is required.
The research in geo-referenced digital libraries cannot begin with a choice of computing languages, methods, approaches or techniques. We need to question the library appearances, what and how the infrastructural models, computing abstraction, programming tools, or networking strategies will grant us the desired scalability for a given application.
ADL homepage, Alexandria Digital Library: Publications, Research Papers, Current Bibliography.
Borgman, C. L. (1999). "What Are Digital Libraries? Competing Visions," Information Processing and Management, 35.
Bunker, G. & Zick, G. (1999)."Collaboration as a Key to Digital Library Development," D-Lib Magazine, ( 5)3.
Campbell, A. (2000). "Where Are Map Libraries Heading? Some Route Maps for the Digital Libraries," LIBER Quarterly, the Journal of European Research Libraries 10:489-498.
Chen, S-S.(1998). Digital Libraries: the Life Cycle of Information. BE (Better Earth).
Collier, M. (1997). "Towards a General Theory of the Digital Library." Proc. of the International Symposium on Research, Development and Practice in Digital Libraries: ISDL'97, Japan.
Dretske, F. I. (1981) Knowledge and the Flow of Information. Blackwell.
Freeston, M. & Hill, L. L. (2001). "What's Holding up the Georeferenced DLs."In: P. Constantopoulos & I. T. Sølvberg (Ed.), Research and Advanced Technologies for Digital Libraries, Springer-Verlag, September.
Frew, J., Carver, L., Fisher, C., Goodchild, M., Larsgaard, M., Smith, T., and Zheng, Q. (1995). "The Alexandria Rapid Prototype: Building a Digital Library for Spatial Information."In: ESRI International User Conference, Environmental Systems Research Institute, Palm Springs.
Frew, J., Freeston, M., Freitas, N., Hill, L., Janee, G., Lovette, K., Nideffer, R., Smith, T. and Zheng, Q. (1999). "The Alexandria Digital Library Architecture." International Journal on Digital Libraries, 2 (4): 259-268.
Greenstein, D. & Thorin, S. E. (2002),The Digital Library: a Biography, Washington: Digital Library Federation and Council on Library and Information Resources. 2nd ed. December.
Harter, S. P. (1997). Scholarly Communication and the Digital Library: Problems and Issues, J. of Digital Information, 04 April.
Janee (2002).Current Architectures & Known Limitations (Sign)
JISC homepage. Joint Information Systems Committee.
Keller, C. P. (2001). "The Map Library's Future."Cartographic Perspectives, 38:73-77.
Kemp, D. A. (1976). The Nature of Knowledge: an Introduction for Librarians. London: Clive Bingley Ltd.
LIBER (2002). Strategies for Survival, Conference of the Journal of European Research Libraries.
NSDL (2003)."Building a 'Memory' for the National Science Digital Library,Online7( 2):Jan.. 22
Parry, R. B. (2003). "Who's Saving the Files? Towards a New Role for Local Map Collections?" LIBER Quarterly, the Journal of European Research Libraries, 13.
Poland, J. (2000). "Cooperative Development of the Digital Library: Identifying and Working with Potential Partners," 66th IFLA Council and General Conference, 13-18 August.
Queue (2003). "Building Web Services," ACM Queue, March.
Schatz, B. R. & Hardin, J. B. (1994). "NCSA Mosaic and the World Wide Web: Global Hypermedia Protocols for the Internet," Science Magazine, 265(12 August): 895-901.
Smiths, J. (2000). "Can a Map be a Geographic Information Retrieval Tool?" LIBER Quarterly, the Journal of European Research Libraries, 10(10).
Star, S. L. (2002). "Got Infrastructure? How Standards, Categories and Other Aspects of Infrastructure Influence Communication," the 2nd Social Study of IT Workshop at the LSE ICT and Globalization, London, 22-23 April.
SUN (2003). The Digital Library Toolkit, Third Edition, Sun Microsystems, January.
Williams, P., The American Public Library and the Problem of Purpose, 1988, New York: Greenwood Press.