Topical Indexing

HIVE Overview | Demo | HIVE Community | Publications | Archives

Background

Maui is the state of the art in automatic topical indexing using controlled vocabularies. The algorithm was tested in Medelyan’s dissertation using MeSH and a collection of 500 full-text documents from PubMed. Maui uses a supervised learning approach, requiring training data.

ImplP@10R@10F@10
Maui, NB, (all features)41.832.036.3
Maui, DT, (non Wik features)52.039.144.6
Maui, DT (all features)55.441.747.6

Question

What is Maui’s best performance using MeSH on the Dryad test collection? Compare Maui’s suggested MeSH terms to Dryad author supplied keywords (mapped to MeSH), BIOSIS Previews concept codes (mapped to MeSH), and PubMed indexing.

Test Collections

CollectionRecordsTerms
Dryad83 records462 terms, 344 unique keywords mapped to 155 unique MeSH headings
BIOSIS Previews108 records1285 terms; 91 unique concept codes mapped to 94 unique MeSH headings
PubMed189 records1513 terms; 336 unique MeSH headings

Mapping summary

The following table presents a summary of the mapping process from each test collection to the MeSH vocabulary. Since PubMed uses MeSH, no mapping was necessary, resulting in 100% match. Mapping of BIOSIS concept codes to MeSH terms was challenging. Many BIOSIS concept codes mapped to multiple MeSH headings (1:many), which is captured in the “Other” category. Only 23% of Dryad depositor-supplied terms mapped directly to MeSH terms. 58% mapped to broader terms, suggesting that depositors select terms that are narrower than the concepts represented in MeSH.

Match typeBIOSISPubMedDryad
Matched preferred33%100%12%
Matched alternate19%0%11%
Alternate spelling0%0%0%
Matched broader0%0%58%
Matched narrower0%0%6%
Other46%0%3%
No match1%0%8%

Method

  • Partition test collection into 10 random training/test sets
  • Build Maui models using training sets
  • Run Maui topic extraction (n=20) using test sets
  • Average precision@n, recall@n, and f1@n across the random sets
  • Tune parameters to maximize precision@k, recall@k, and f1
  • Inputs: title, abstract, keywords, and data (time permitting)

Build models and run topic extraction

  • Partition test collections into 10 random training (90%) and test (10%) collections
  • For each collection (biosis, pubmed, dryad)
  • For each input type (title, title+abs, title+abs+kw)
  • For each partition (0-9)
  • For each stemmer (None, Lovins, Sremoval, Porter)
  • For minoccur (1,2)
  • Create new model
java -$heapsize maui.main.MauiModelBuilder -l partitions/$collection/$format/part$part/train/ -m partitions/$collection/$input_type/part$part/$model
  -v mesh_20110305 -f skos -i en -e utf-8 -x 5 -y 1 -o $minoccur -t $stemmer -d >& logs/$collection-$format-$model.log
  • Run topic extraction (n = 20) with IDF feature disabled
java -$heapsize maui.main.MauiTopicExtractor -l partitions/$collection/$format/part$part/test/ -m partitions/$collection/$input_type/part$part/$model
  -v mesh_20110305 -f skos -i en -e utf-8 -n 20 -t $stemmer -d >> results/$collection-$format-$part-$model.out
  • Run topic extraction (n = 20) with IDF feature enabled
java -$heapsize maui.main.MauiTopicExtractor -l partitions/$collection/$format/part$part/test/ -m partitions/$collection/$input_type/part$part/$model
  -v mesh_20110305 -f skos -i en -e utf-8 -n 20 -t $stemmer -b -d >> results/$collection-$format-$part-$model-tfidf.out

Maui reports precision@n, recall@n, F@n. Results are averaged across the 10 test collections.

Results

This section presents the results of Maui indexing (default Naive Bayes classifier) using the MeSH vocabulary with each test collection and k=10. Maui achieves maximum recall of 43.81% of Dryad depositor-supplied subject keywords as mapped to MeSH.

Dryad

MeasureValueConfiguration
P@1024.84title, Sremoval stemmer, min occur=1
R@1043.81title+abs+kw, No stemmer, min occur=1
F@1029.10title+abs+kw, Sremoval, min occur=1
  • Maui achieves maximum 43.81% recall for Dryad depositor-supplied keywords as manually mapped to MeSH terms with k=10.

BIOSIS

MeasureValueConfiguration
P@1013.51title, Lovins stemmer, min occur=1
R@1011.08title+abs+kw, Porter stemmer, min occur=1
F@1011.60title+abs+kw, Porter stemmer, min occur=1
  • Maui achieves only 11.08% recall for BIOSIS concept codes as manually mapped to MeSH terms with k=10.

PubMed

MeasureValueConfiguration
P@1026.47title+abs+kw, Lovins stemmer, min occur=1
R@1032.26title+abs+kw, Lovins stemmer, min occur=1
F@1028.95title+abs+kw, Lovins stemmer, min occur=1
  • Maui achieves 32.26% recall for PubMed MeSH headings with k=10.

For full results, see File:Maui Results.pdf