National Geologic Map Database Project -- U.S. Geological Survey


NOTE: this document contains information on activities of the National Geologic Map Database Project. It is informal, generally time-sensitive material intended for project members, cooperators, and interested parties.

National Geologic Map Database Project



A strategy for creation of formal metadata

This strategy was proposed to the USGS Geologic Division, by the Division Information Council. It was developed to help improve metadata-writing skills in the Geologic Division. For state geological surveys and others, it may serve as useful guidance in developing an appropriate strategy.



Executive summary

This document proposes a strategy for making the research results of the Geologic Division more visible, accessible, and relevant to our customers. Creating and supporting an online catalog of high-quality, detailed metadata will make the Geologic Division's digital data products consistent in appearance, better documented, and more widely known. With the right contact information for USGS personnel and a commitment to follow-up support, these metadata will help to make the Division's products indispensable to our customers.

The explosive growth of the worldwide web (WWW) has made it critical for information providers such as the USGS to create well-categorized products that can be easily found. A simple search through existing web search services can turn up tens of thousands of relevant documents. Search services rank these documents, displaying first those most likely to meet the user's needs. Unfortunately, the ranking relies on the correspondence between what was typed into the search form and key words in the document. Moreover the search request is often not a precise definition of the user's real needs. We can ensure better results only by allowing more precise queries, which can be answered only with more precisely- and consistently-structured documents.

Metadata are formally-structured documentation of digital data products. They describe the "who, what, when, where, why, and how" of every aspect of the data. Creating metadata is like library cataloging, but for metadata more details of the science behind the data must be understood and included in the record. Executive Order 12906 requires Federal agencies to document all new digital geospatial data products using standard metadata; this strategy is intended to meet that requirement as well as to exploit the power of this organizing technology.

An overall Division strategy to implement metadata is given below, followed by more detailed recommendations concerning work-flow, available tools, training, and finally, exchange of metadata and use of the National Geospatial Data Clearinghouse to enhance our own research.

This strategy will require the commitment of some technologically-oriented scientists or scientifically-literate computer specialists, at least one per Division program or team. No unusual hardware or software acquisition is called for. Initially the online catalog can be situated at a single web site equipped with some sophisticated software; in the future the catalog can be presented from many sites.

Metadata specialists will serve as de facto program- or team-level data managers. They will need two days of intensive training and follow-up support (via the network) from those in the Division who have more experience with metadata. The first day of the two-day training should be attended by those who will review metadata as well as those who will create it.

This scheme will make data product development more consistent Division-wide, which will increase quality and, more importantly, increase awareness of, and use of our information by our customers.

Background

Throughout the 1980s and early 1990s, the improving capability of desktop computers to carry out complex analyses has increased the popularity of geographic information systems (GIS). As they became familiar with GIS technology, people at all levels of government, in industry, and in academia have been calling for better access to publically-available geospatial information and more general use of standard terms of reference and of standard formats for the exchange of geospatial data and information.

In April of 1994, President Clinton signed Executive Order 12906 initiating the National Spatial Data Infrastructure (NSDI), a Federal coordination effort to answer the need for quality spatial information. A key element of NSDI is the development of a National Geospatial Data Clearinghouse, a source of descriptions (metadata) of geospatial data that are available to the public. Users can search the metadata held in the Clearinghouse to determine whether geospatial data for a given region exist and whether those data are appropriate for solving their specific problems.

The Clearinghouse is a distributed network of internet sites providing metadata (information about geospatial data) to users in a consistent manner. Its success depends on the overall consistency of the metadata that are made available, because users are expected to evaluate metadata from numerous sources in order to determine which data meet their needs.

The Clearinghouse is especially advantageous to geoscientists, because it helps avoid needless duplication of effort and enhances the quality and usability of new products by providing a way to create consistent documentation even when the data are highly diverse. Often it is easier and less costly to build on existing data rather than to start from scratch, especially when the quality of the existing data are well documented.

To promote consistency in metadata, the Federal Geographic Data Committee (FGDC), an interagency council charged with coordinating the Federal implementation of NSDI, has produced the Content Standards for Digital Geospatial Metadata (CSDGM). That document provides standard terms describing elements common to most geospatial data, and encourages people who document geospatial data sets to use these terms.

The Content Standards for Digital Geospatial Metadata (hereafter referred to simply as "the standard") describes not only the terms of reference but also specifies the relationships among those terms. The relationships, many of which are hierarchical, are complex and a formal syntax is provided to specify them.

Because the syntax of the standard is complex and the number of descriptive elements is fairly large (335), creating metadata that conform to the standard is not an easy task. In addition to the problem of assembling the information needed to properly describe the subject data sets, metadata producers must provide the information using the terms given in the standard and arrange the terms following the syntactical rules of the standard. The resulting metadata are formally structured and use standard terms of reference, hence the term "formal metadata" in the title of this document.

Premises

  1. The Federal Geographic Data Committee has defined a content standard for digital geospatial metadata. Executive order 12906 requires Federal agencies to use this content standard to document digital spatial data.

  2. EO 12906 defines "geospatial data" as "information that identifies the geographic location and characteristics of natural or constructed features and boundaries on the Earth. This information may be derived from, among other things, remote sensing, mapping, and surveying technologies. Statistical data may be included in this definition at the discretion of the collecting agency." Clearly most data that GD produces are covered by this definition.

    Data that are not spatially referenced in any way do not fall under the requirements of this executive order, so there is no direct requirement to use the FGDC standard to document them. There may be practical benefits to doing so, however. Consistency is a boon both to users and producers.

  3. Since the signing of EO 12906, Federal and state agencies have been working with FGDC staff to develop methods for creating and exchanging metadata. These methods include organizational strategies, software tools, and specifications for encoding metadata in digital files.

    EO 12906 specifically requires us to consult the Clearinghouse before collecting new geospatial data, as a potential cost-cutting measure. Consequently the Division should document data products that are in preparation using brief metadata conforming to the Identification Information and Metadata Reference Information sections of the CSDGM.

  4. Consistent with the Bureau Strategic Plan, the Geologic Division has among its principal goals ease of access to and use of its geological information in digital form. To reach these goals, the Division must create digital information in formats that are convenient and consistent with similar data products and make the information available to customers using procedures that are convenient and readily understood.

  5. Documentation is considered proper if it is complete, correct, and consistent with documentation of similar products.

Recommendations

Overall strategy

  1. The Geologic Division (GD) should document its digital geospatial data products using the FGDC Content Standards for Digital Geospatial Metadata (CSDGM).

    The use of the word "products" here implies that digital geospatial data not intended for publication and not subject to release through the Freedom of Information Act (FOIA) need not be formally documented, although it may be advantageous to maintain complete documentation for these data also.

    Creation of formal metadata requires time and effort. In carrying out this work, GD staff should first seek to document those data products that are of most importance to the Programs and the public, and for these, staff should first seek to document the Identification Information and Metadata Reference Information. Those two sections carry the most basic information, answering "who, what, when, where, why, and how" about the data set.

    This minimum level of documentation is not satisfactory in the long term. Full metadata describe details of source data, processing, accuracy, completeness, and consistency that are needed for any substantive use of the products, but much will be gained by first providing abbreviated documentation of our best and most important products.

  2. GD should develop extensions to the standard where it does not cover the unique characteristics of geologic data. Extensions are additional named elements filled in by people who document data sets. These extensions must not be ad hoc creations by individuals because doing so would destroy the consistency of the metadata; the software that processes metadata may not parse them correctly and the information could be lost or misplaced. Extensions must be discussed by GD and FGDC personnel familiar with the Standard and with the software environment of the Clearinghouse. GD should develop a process for approving extensions and publicizing them within the Division.

  3. GD should develop conventions for the application of CSDGM to the goals of Division programs. Such conventions prescribe characteristics of metadata such as the level of detail expected, judgments about which elements are most important for our use, and methods for encoding commonly-used attributes of geological data.

  4. GD should serve its metadata through one or more nodes of the National Geospatial Data Clearinghouse. The metadata records should be provided free of charge and free of constraints on their access and use.

Metadata creation

  1. Initially, the CSDGM itself should be considered sufficient for documenting the Division's geospatial data products.

  2. An electonic mailing list should be set up at USGS to focus discussion of the use of the CSDGM for geologic data. The purpose of the list will be to identify extensions to the CSDGM needed to properly describe digital geologic data, specifying in formal terms both the syntactical structure and the semantic content of the extensions.

  3. Metadata acceptable to the Division must be computer-parseable and conform to encoding guidelines for formal metadata established by FGDC for the National Geospatial Data Clearinghouse.

  4. Methods for creating metadata depend on the software environment used to produce the primary data, the organizational structure of the metadata producer, and the availability and quality of software tools. The Division needs to recognize the most common combinations of these circumstances and find the best methods for each case.

    There are three issues to be resolved: work-flow; tools; and training.

    Work-flow

    Work-flow concerns who does what. GD research has traditionally been carried out by research scientists and their technicians. To create, promote, and support digital products, research groups must now cultivate not only scientific and technical expertise but also specific training and experience in system administration, data management, and customer support.

    We should not assume that data producers (scientific authors and GIS specialists) have to create formal metadata. For many of these people this would not be a wise use of their time, because they typically produce no more than one or a few digital data products per year. Instead, formal metadata should be produced by data managers specifically charged with the task of facilitating the production and use of digital products. Authors and GIS specialists must provide information about the data product, of course, but they need not learn the intricacies of the FGDC standard. Metadata can be produced more efficiently by data managers working together with the creators of the data set.

    While many Division programs have staff devoted to data management, not all are in positions specifically designated for this function. With creation of formal metadata added to standard data management activities such as data base administration, product development, promotion, and customer support, the role of data manager increasingly occupies the full attention of technologically-oriented scientists and scientifically-literate computer specialists. Increasingly, customers of USGS research results are sophisticated computer users; programs and teams that staff adequately for data management will be more effective in providing critical solutions to meet these users' needs.

    Tools

    The chief advantage of using a standard for formal metadata is consistency in both content and display. The CSDGM allows us to describe highly diverse data sets in a way that emphasizes what these data sets have in common. To convey this consistency, the metadata must be written so that corresponding elements in different metadata records are recognizable as such. Our use of the internet to provide the records requires that the records be understood by software, not just by sophisticated people. Consequently, the most usable metadata records will be those that conform to rigorous standards for both structure and format.

    Formal metadata can be produced using a word processor, but there are good reasons to choose more specialized tools that understand the structure of the metadata standard and the file formats expected by the Clearinghouse. Current software developed within GD for creating formal metadata run under Unix systems. This is economical, as at the moment many research databases in the Division are created and maintained on Unix systems. While it is in principle possible to port these applications to Microsoft Windows or Macintosh environments, the costs need to be carefully weighed against the benefits.

    BRD has created a tool to work under MS-Windows, but it does not yet conform to file encoding guidelines issued by FGDC.

    ARC/INFO users may work with DOCUMENT, an AML-based extension developed by WRD and EPA. This tool has important shortcomings that limit its usefulness within GD. Some work by experienced GIS specialists might ameliorate the technical shortcomings but the software presumes a work-flow structure that is not ideal; the GIS specialist must produce the metadata. Many GIS users, especially novices, struggle mightily with ARC/INFO itself, and adding complex formal metadata to their list of duties may not be practical.

    Training

    We will need to provide some formal training and back up that training with robust and responsive user support.

    A general introduction to formal metadata can be conveyed in a few hours and FGDC staff are available to supply this sort of background information. More detailed training specifically intended to develop real facility will take considerably longer. Two-day workshops, one in each major center, should serve this purpose if the attendees have a proper understanding of the need for this activity.

    Proposed outline for two-day training sessions

    1. Attendees must be aware of the general scope of the problem and must come prepared to document a data set from their programs. Attendees must be familiar with basic Unix use.

    2. Introduction to formal metadata and the FGDC standard.

      Half-day presentation and discussion with GD examples.

    3. Introduction to USGS tools ( xtme, mp, cns, DOCUMENT )

      Half-day presentation and discussion with GD examples.

    4. Supervised use of tools and users' data

      Half-day work session to create metadata for users' data.

    5. Building and using a Clearinghouse node

      Half-day presentation and hands-on exercises culminating in creation of a working Clearinghouse site serving attendees' metadata records.

  5. Metadata must be reviewed prior to release. Review must include analysis of the syntactical structure of the metadata using the metadata parser (mp) and perusal of the metadata by a scientist familiar with the data.

    To attain high quality and timeliness, the reviewers of metadata should understand it. To develop the needed expertise, representatives of the publications groups should attend at least the first day of the two-day workshops described above. Their work will be better and easier if they attend both days.

Metadata exchange

  1. Metadata creators must transfer their metadata to an operational node of the National Geospatial Data Clearinghouse or create and maintain a node of their own. Initially the Division has set up a single Clearinghouse node on a system in Reston. As more metadata are produced in the regional centers, we might consider running Clearinghouse nodes from the publications groups, program offices, or regional geologists' offices.

  2. Metadata transferred to a Clearinghouse node must be encoded in SGML conforming to the current document type declaration (DTD) of the FGDC.

    SGML stands for Standard Generalized Markup Language. This is a way to specify the structure of a textual document. FGDC's use of SGML is relatively simple and is intended by them to move metadata tools towards open standards for file encoding.

    By special arrangement with a Clearinghouse node, metadata may be transferred using the indented text format acceptable to the USGS metadata parser mp provided the metadata are not missing any required elements. mp converts proper metadata to SGML, HTML, text, and NASA's Directory Interchange Format (DIF).

  3. Metadata are retrieved from a Clearinghouse node using a web browser or other HTTP agent. Ordinarily metadata are served in HTML but may also be provided in SGML, indented text, or DIF.

    The Division should create an appropriate link to its Clearinghouse node on its public and internal web pages. The Division's node will be registered with FGDC so that it will be accessible from the National Geospatial Data Clearinghouse.


This page is <URL:http://ngmdb.usgs.gov/info/standards/metadata/meta_strat.html>
Strategy written by Peter Schweitzer for consideration by the USGS Geologic Division Information Council
Page maintained by Dave Soller
last update: Sept. 16, 1997