Dryad Gold Set

HIVE Overview | Demo | HIVE Community | Publications | Archives

Overview

Indexing of geographic location information in library catalogs, bibliographic databases, and digital libraries has a long history and is closely related to the subfield of geographic information retrieval. Geographic indexing generally focuses on three areas: (a) geographical topicality of an resource as identified by concepts in the text; (b) geographic applicability of an information resource; and (c) the location of the resource itself (e.g., place of publication or distribution).

Williams (2009) describes several problems encountered during manual geographic indexing. Place names, which are frequently used for geographic indexing, are difficult to define, frequently change over time, and sometimes cease to exist altogether. Places are referred to by many names and often have names in multiple languages. Many place names are not unique, a condition referred to as toponymic homonymy (e.g., Springfield). Many place names are also non-geographic homonyms. These problems are not limited to manual indexing and are central challenges to automatic geographic indexing.

The field of geographic information retrieval (GIR) emerged in 1990s as researchers began to explore new ways of searching and visualizing information held in online catalogs and digital libraries based on spatial attributes (Woodruff & Plaunt, 1994; Buckland et al. 2004). Geographic location information typically comes in three forms: (a) maps, (b) numerical coordinates, and (c) natural language text (Buchel & Hill, 2009). Recent GIR research has focused primarily on techniques for extracting and resolving geographic locations in unstructured text as place names or coordinates and resolving ambiguous references.

The study proposed in this paper is part of a three-part series exploring techniques for the automatic indexing of topical, geographical, and taxonomic information Dryad, a scientific data repository. The purpose of this study is the review and evaluation of techniques for the automatic indexing of geographic locations in data sets and associated article metadata deposited in the Dryad repository.

Geographic Information Retrieval

Research in geographic information retrieval focuses primarily on in the identification of geographic location information in unstructured text and the resolution of toponymic and non-toponymic homonyms. The process of identifying location information in text is referred to as text-based georeferencing (Woodruff & Plaunt, 1994), geoparsing (Leiden, 2007), or geotagging (Amitay et al. 2004). For the purpose of this article, these processes will all be referred to as georeferencing. The process of mapping natural language to spatial coordinate systems is referred to as geocoding.

Like other Named Entity Recognition (NER) tasks, georeferencing consists of two phases: name identification and name disambiguation. There are three primary types of georeferencing techniques: 1) knowledge-based (i.e., rely on external sources of information such as gazetteers), 2) rule-based, and 3) machine-learning based.

Woodruff and Plaunt (1994) describe GIPSY, a system for geographic indexing of text documents using spatial coordinates, which uses a knowledge-based approach. In their system, words and phrases containing place names in unstructured text are mapped to coordinates using an intermediate thesaurus based on the US Geological Survey’s Geographic Name Information System (GNIS) and the Geographic Information Retrieval and Analysis System (GIRAS) land-use data set. Document text is matched against the thesaurus and geographic names are translated into coordinate information. Spatial reasoning techniques are used to approximate the location referenced in the text.

Amitay et al. (2004) describe an approach to identifying the central geographic focus of web pages also using a knowledge-based approach. The authors develop a gazetteer by combining multiple public sources of information (GNIS, UNSD, ISO-3166-1). The text is then parsed and all terms matching names in the combined gazetteer are extracted. A disambiguation algorithm is used to resolve toponyms (the algorithm itself is not detailed in the article). Each place name in the document is scored based on frequency and the frequency information is used to determine overall page focus. Results were compared to a test collection ranked by human editors using the Open Directory Project (ODP).

Martins and Silva (2005) developed a PageRank-inspired graph-based ranking algorithm for georeferencing using an ontology. They developed an ontology of geographic concepts by merging several public sources. The text of documents is parsed and matched to concepts in the ontology. The list of candidate geographic concepts is then ranked using the PageRank algorithm, leveraging the relationships between geographic places in the ontology to resolve ambiguities. Since natural language texts often include multiple clues about important geographic places (e.g., mention city and state), terms used in the text that are also related in the ontology score higher.

Buscaldi and Rosso (2008) describe an approach to place name disambiguation using Wordnet. They extracted places names and relationships from WordNet and applied a conceptual density approach to disambiguate place names. Conceptual density (CD) is a measure of the correlation between a sense of a word and its context using term subhierarchies. The authors’ approach out-performed a standard dictionary-based approach based on measures of precision and recall.

Overell and Ruger (2008) used Wikipedia to generate a co-occurrence model that is then used for place name disambiguation. The authors created a “gold set” or “groundtruth” for evaluation, consisting of 1,000 Wikipedia articles mapped to TGN records. The authors then apply a hierarchical disambiguation method that combines statistical co-occurrence information with a rule-based approach to resolve ambiguous place names. The rule-based approach applies structural information from within Wikipedia articles, such as template data (lat/long), categories, or references/links between articles.

Of the various approaches described above, the graph-based ranking algorithm proposed by Martins and Silva (2005) appears to be the best-suited approach to geographic indexing in the Dryad repository using a controlled vocabulary like TGN.

Authoritative Sources of Geographic Information

Indexing of geographic locations, sometimes referred to as place names, has a long history in traditional bibliographic indexing. The Library of Congress Subject Headings (LCSH) contains over 60,000 geographic headings. The Medical Subject Headings (MeSH), the primary vocabulary used for indexing in MEDLINE and PubMed, contains 500 geographic descriptors. AGROVOC, the primary vocabulary for indexing FAO resources, contains 1,700 geographic descriptors.

The Getty Thesaurus of Geographic Names (TGN) is a controlled vocabulary compiled by the Getty Research Institute intended for the description and organization of information about art, architecture, and material culture. The TGN is hierarchically structured and contains over 1.1 million records and associated information about over 900,000 places. The TGN includes over 1,500 difference place and feature types, from planets, oceans, continents, and nations to towns, villages, basins and creeks. For each location, the TGN includes the preferred name, current and historic variant names, geographic coordinates, place types, and relationships to other places in the vocabulary.

Since the 1970s, bibliographic databases have provided access to geographic concepts through controlled vocabularies and thesauri. The GeoRef database, established in 1966, was the first to include spatial coordinates for some records. Most databases provide geographic information access through the use of place names in controlled vocabularies. The next section reviews instructions given to professional indexers for identifying and indexing geographic information.

Research questions

This study will explore the following questions:

How can important geographic aspects of scientific articles and data be automatically determined?
What is the best algorithm for automatic indexing of geographic information in Dryad with the Getty Thesaurus of Geographic Names (TGN).
How often do author-assigned or indexer-assigned geographic terms appear in the article metadata or deposited data?
Are there any distinctions between depositor and information professional indexing of geographic names?

Methodology

This section describes the methodology proposed for the evaluation of algorithms for geographic indexing in the Dryad repository using The Getty Thesaurus of Geographic Names (TGN). For this study, two corpora will be used—the Dryad “gold set” and a modified version of Overell and Ruger’s Wikipedia “ground truth.” Both collections are described in detail below. The Maui machine-learning algorithm and a graph-based ranking algorithm, described further below, will be evaluated using both corpora and standard measures of precision@20 and mean average precision.

Dryad Corpus — Geographic

The geographic indexing portion of the Dryad corpus consists of the following:

Dryad article metadata for 200 records in METS format, including depositor-supplied geographic keywords
BIOSIS Previews article metadata for the same 200 records, in Thomson ISI export format, including BIOSIS indexing for geographic data.
PubMed article metadata for the same 200 records, in PubMed XML format, including MeSH geographic headings.
For each Dryad record, map geographic keywords to TGN IDs.
For each BIOSIS Previews record, map geographic data to TGN IDs.
For each PubMed record, map geographic headings to TGN IDs.

Wikipedia “Groundtruth”

The original “groundtruth” created by Overell and Ruger is mapped to an earlier version of TGN that is no longer available. For the purpose of this evaluation, their groundtruth mapping files will be downloaded and any invalid entries mapped to the latest (2011) version of TGN or removed from the test set. The original groundtruth also does not include the Wikipedia articles used for testing. A recent Wikipedia XML corpus will be downloaded and associated articles extracted.

Algorithms

As part of this study, two algorithms will be evaluated using the Dryad gold set and Wikipedia groundtruth.

Maui: A Maui machine-learning based algorithm, will be evaluated using 10-fold cross validation. The Dryad and Wikipedia corpora will be randomly divided into 10 test and training sets. The training sets will be used to build the Maui statistical models and the test sets for performance evaluation. Maui parameters will be tuned and the best-performing configuration used for the final comparison.
Graph-based ranking approach: A graph-based approach similar to the one described by Martins and Silva (2005) will be evaluated also using 10-fold cross validation. Since this is an unsupervised method, no training data is required. Parameters will be tuned and the best-performing configuration used for final comparison.

Metrics

Standard measures of precision@20, mean average precision, and precision-recall curves will be used to evaluate the performance of the two systems using the two provided test collections.

References

Amitay, E., Har’El, N., Sivan, R., and Soffer, A. (2004). “Web-a-Where: Geotagging Web Content.” SIGIR’04, July 25-29, 2004.

Bliss, H. E. (1939). The Organization of Knowledge in Libraries and the Subject-Approach to Books. New York: H.W. Wilson.

BIOSIS. (1987). BIOSIS Previews Search Guide. Philadelphia: BIOSIS.

BIOSIS. (2011). BIOSIS Previews Help. http://images.webofknowledge.com/WOK45/help/BIOSIS/h_fullrec.html

Buchel, O, and Hill, L. L. (2009). “Treatment of Georeferencing in Knowledge Organization Systems: North American Contributions to Integrated Georeferencing.” North American Symposium on Knowledge Organization, June 18-19, 2009, Syracuse, New York.

Buckland, M., Chen, A., Gey, F. C., Larson, R. R., Mostern, R., and Petras, V. (2007). “Geographic Search: Catalogs, Gazetteers, and Maps.” College & Research Libraries 68(5): 376-387.

Buckland, M. and Lancaster, L. (2004). “Combining place, time, and topic: The Electronic Cultural Atlas Initiative.” D-Lib Magazine 10(5).

Buscaldi, D., and Rosso, P. (2007). “A conceptual density-based approach for the disambiguation of toponyms.” International Journal of Geographical Information Science, 22(3): 301-313.

Hill, L. (1989). “Geographic Indexing for Bibliographic Databases.” Resource Sharing & Information Networks 4(2): 1-12.

Hodge, G. (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. Washington, D.C.: The Digital Library Federation.

Hood, M.W. (1990). AGRICOLA – Guide to Subject Indexing.

J. Paul Getty Trust. (2000). Getty Thesaurus of Geographic Names: User’s Guide to the TGN Data Releases.

Leidner, J. L. and Lieberman, M.D. (2011). “Detecting Geographical References in the Form of Place Names and Associated Spatial Natural Language.” SIGSPATIAL Special, 3(2): 5-11

Leveling, J., and Hartrumpf, S. (2008). “On metonymy recognition for geographic information retrieval.” International Journal of Geographic Information Science 22(3): 289-299.

Martins, B., Anastacio, I., and Calado, P. 2010. “A Machine Learning Approach for Resolving Place References in Text.” In M. Painho et al., (eds.)., Geospatial Thinking, Lecture Notes in Geoinformation and Cartography.

Martins, B. and Silva, M. (2005). “A Graph-Ranking Algorithm for Geo-Referencing Documents.” Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05).

Medelyan, O. (2009). Human-competitive automatic topic indexing. The University of Waikato.

National Library of Medicine. (2005). MEDLINE Indexing: Online Training Course.

Overell, S. E., & Rüger, S. (2006). Identifying and grounding descriptions of places. Information Retrieval, 2-4.

Overell, S., & Rüger, S. (2008). Using co‐occurrence models for placename disambiguation. International Journal of Geographical Information Science, 22(3), 265-287.

Ranganathan, S.R. (1937). Prolegomena to Library Classification. Madras: The Madras Library Association.

Williams, P. (2008). “The problem with place names: the moulds may change, but the jelly remains the same.” Catalogue & Index 157.

Woodruff, A.G. and Plaunt, C. (1994). “GIPSY: Automated Geographic Indexing of Text Documents.” Journal of the American Society for Information Science, 45(9).