The explosive growth of the worldwide web (WWW) has made it critical for information providers such as the USGS to create well-categorized products that can be easily found. A simple search through existing web search services can turn up tens of thousands of relevant documents. Search services rank these documents, displaying first those most likely to meet the user's needs. Unfortunately, the ranking relies on the correspondence between what was typed into the search form and key words in the document. Moreover the search request is often not a precise definition of the user's real needs. We can ensure better results only by allowing more precise queries, which can be answered only with more precisely- and consistently-structured documents.
Metadata are formally-structured documentation of digital data products. They describe the "who, what, when, where, why, and how" of every aspect of the data. Creating metadata is like library cataloging, but for metadata more details of the science behind the data must be understood and included in the record. Executive Order 12906 requires Federal agencies to document all new digital geospatial data products using standard metadata; this strategy is intended to meet that requirement as well as to exploit the power of this organizing technology.
An overall Division strategy to implement metadata is given below, followed by more detailed recommendations concerning work-flow, available tools, training, and finally, exchange of metadata and use of the National Geospatial Data Clearinghouse to enhance our own research.
This strategy will require the commitment of some technologically-oriented scientists or scientifically-literate computer specialists, at least one per Division program or team. No unusual hardware or software acquisition is called for. Initially the online catalog can be situated at a single web site equipped with some sophisticated software; in the future the catalog can be presented from many sites.
Metadata specialists will serve as de facto program- or team-level data managers. They will need two days of intensive training and follow-up support (via the network) from those in the Division who have more experience with metadata. The first day of the two-day training should be attended by those who will review metadata as well as those who will create it.
This scheme will make data product development more consistent Division-wide, which will increase quality and, more importantly, increase awareness of, and use of our information by our customers.
In April of 1994, President Clinton signed Executive Order 12906 initiating the National Spatial Data Infrastructure (NSDI), a Federal coordination effort to answer the need for quality spatial information. A key element of NSDI is the development of a National Geospatial Data Clearinghouse, a source of descriptions (metadata) of geospatial data that are available to the public. Users can search the metadata held in the Clearinghouse to determine whether geospatial data for a given region exist and whether those data are appropriate for solving their specific problems.
The Clearinghouse is a distributed network of internet sites providing metadata (information about geospatial data) to users in a consistent manner. Its success depends on the overall consistency of the metadata that are made available, because users are expected to evaluate metadata from numerous sources in order to determine which data meet their needs.
The Clearinghouse is especially advantageous to geoscientists, because it helps avoid needless duplication of effort and enhances the quality and usability of new products by providing a way to create consistent documentation even when the data are highly diverse. Often it is easier and less costly to build on existing data rather than to start from scratch, especially when the quality of the existing data are well documented.
To promote consistency in metadata, the Federal Geographic Data Committee (FGDC), an interagency council charged with coordinating the Federal implementation of NSDI, has produced the Content Standards for Digital Geospatial Metadata (CSDGM). That document provides standard terms describing elements common to most geospatial data, and encourages people who document geospatial data sets to use these terms.
The Content Standards for Digital Geospatial Metadata (hereafter referred to simply as "the standard") describes not only the terms of reference but also specifies the relationships among those terms. The relationships, many of which are hierarchical, are complex and a formal syntax is provided to specify them.
Because the syntax of the standard is complex and the number of descriptive elements is fairly large (335), creating metadata that conform to the standard is not an easy task. In addition to the problem of assembling the information needed to properly describe the subject data sets, metadata producers must provide the information using the terms given in the standard and arrange the terms following the syntactical rules of the standard. The resulting metadata are formally structured and use standard terms of reference, hence the term "formal metadata" in the title of this document.
Data that are not spatially referenced in any way do not fall under the requirements of this executive order, so there is no direct requirement to use the FGDC standard to document them. There may be practical benefits to doing so, however. Consistency is a boon both to users and producers.
EO 12906 specifically requires us to consult the Clearinghouse before collecting new geospatial data, as a potential cost-cutting measure. Consequently the Division should document data products that are in preparation using brief metadata conforming to the Identification Information and Metadata Reference Information sections of the CSDGM.
The use of the word "products" here implies that digital geospatial data not intended for publication and not subject to release through the Freedom of Information Act (FOIA) need not be formally documented, although it may be advantageous to maintain complete documentation for these data also.
Creation of formal metadata requires time and effort. In carrying out this work, GD staff should first seek to document those data products that are of most importance to the Programs and the public, and for these, staff should first seek to document the Identification Information and Metadata Reference Information. Those two sections carry the most basic information, answering "who, what, when, where, why, and how" about the data set.
This minimum level of documentation is not satisfactory in the long term. Full metadata describe details of source data, processing, accuracy, completeness, and consistency that are needed for any substantive use of the products, but much will be gained by first providing abbreviated documentation of our best and most important products.
There are three issues to be resolved: work-flow; tools; and training.
We should not assume that data producers (scientific authors and GIS specialists) have to create formal metadata. For many of these people this would not be a wise use of their time, because they typically produce no more than one or a few digital data products per year. Instead, formal metadata should be produced by data managers specifically charged with the task of facilitating the production and use of digital products. Authors and GIS specialists must provide information about the data product, of course, but they need not learn the intricacies of the FGDC standard. Metadata can be produced more efficiently by data managers working together with the creators of the data set.
While many Division programs have staff devoted to data management, not all are in positions specifically designated for this function. With creation of formal metadata added to standard data management activities such as data base administration, product development, promotion, and customer support, the role of data manager increasingly occupies the full attention of technologically-oriented scientists and scientifically-literate computer specialists. Increasingly, customers of USGS research results are sophisticated computer users; programs and teams that staff adequately for data management will be more effective in providing critical solutions to meet these users' needs.
Formal metadata can be produced using a word processor, but there are good reasons to choose more specialized tools that understand the structure of the metadata standard and the file formats expected by the Clearinghouse. Current software developed within GD for creating formal metadata run under Unix systems. This is economical, as at the moment many research databases in the Division are created and maintained on Unix systems. While it is in principle possible to port these applications to Microsoft Windows or Macintosh environments, the costs need to be carefully weighed against the benefits.
BRD has created a tool to work under MS-Windows, but it does not yet conform to file encoding guidelines issued by FGDC.
ARC/INFO users may work with DOCUMENT, an AML-based extension developed by WRD and EPA. This tool has important shortcomings that limit its usefulness within GD. Some work by experienced GIS specialists might ameliorate the technical shortcomings but the software presumes a work-flow structure that is not ideal; the GIS specialist must produce the metadata. Many GIS users, especially novices, struggle mightily with ARC/INFO itself, and adding complex formal metadata to their list of duties may not be practical.
A general introduction to formal metadata can be conveyed in a few hours and FGDC staff are available to supply this sort of background information. More detailed training specifically intended to develop real facility will take considerably longer. Two-day workshops, one in each major center, should serve this purpose if the attendees have a proper understanding of the need for this activity.
Half-day presentation and discussion with GD examples.
Half-day presentation and discussion with GD examples.
Half-day work session to create metadata for users' data.
Half-day presentation and hands-on exercises culminating in creation of a working Clearinghouse site serving attendees' metadata records.
To attain high quality and timeliness, the reviewers of metadata should understand it. To develop the needed expertise, representatives of the publications groups should attend at least the first day of the two-day workshops described above. Their work will be better and easier if they attend both days.
SGML stands for Standard Generalized Markup Language. This is a way to specify the structure of a textual document. FGDC's use of SGML is relatively simple and is intended by them to move metadata tools towards open standards for file encoding.
By special arrangement with a Clearinghouse node, metadata may be transferred using the indented text format acceptable to the USGS metadata parser mp provided the metadata are not missing any required elements. mp converts proper metadata to SGML, HTML, text, and NASA's Directory Interchange Format (DIF).
The Division should create an appropriate link to its Clearinghouse node on its public and internal web pages. The Division's node will be registered with FGDC so that it will be accessible from the National Geospatial Data Clearinghouse.