Topical Indexing

HIVE Overview | Demo | HIVE Community | Publications | Archives

Background

Maui is the state of the art in automatic topical indexing using controlled vocabularies. The algorithm was tested in Medelyan’s dissertation using MeSH and a collection of 500 full-text documents from PubMed. Maui uses a supervised learning approach, requiring training data.

Impl	P@10	R@10	F@10
Maui, NB, (all features)	41.8	32.0	36.3
Maui, DT, (non Wik features)	52.0	39.1	44.6
Maui, DT (all features)	55.4	41.7	47.6

Question

What is Maui’s best performance using MeSH on the Dryad test collection? Compare Maui’s suggested MeSH terms to Dryad author supplied keywords (mapped to MeSH), BIOSIS Previews concept codes (mapped to MeSH), and PubMed indexing.

Test Collections

Collection	Records	Terms
Dryad	83 records	462 terms, 344 unique keywords mapped to 155 unique MeSH headings
BIOSIS Previews	108 records	1285 terms; 91 unique concept codes mapped to 94 unique MeSH headings
PubMed	189 records	1513 terms; 336 unique MeSH headings

Mapping summary

The following table presents a summary of the mapping process from each test collection to the MeSH vocabulary. Since PubMed uses MeSH, no mapping was necessary, resulting in 100% match. Mapping of BIOSIS concept codes to MeSH terms was challenging. Many BIOSIS concept codes mapped to multiple MeSH headings (1:many), which is captured in the “Other” category. Only 23% of Dryad depositor-supplied terms mapped directly to MeSH terms. 58% mapped to broader terms, suggesting that depositors select terms that are narrower than the concepts represented in MeSH.

Match type	BIOSIS	PubMed	Dryad
Matched preferred	33%	100%	12%
Matched alternate	19%	0%	11%
Alternate spelling	0%	0%	0%
Matched broader	0%	0%	58%
Matched narrower	0%	0%	6%
Other	46%	0%	3%
No match	1%	0%	8%

Method

Partition test collection into 10 random training/test sets
Build Maui models using training sets
Run Maui topic extraction (n=20) using test sets
Average precision@n, recall@n, and f1@n across the random sets
Tune parameters to maximize precision@k, recall@k, and f1
Inputs: title, abstract, keywords, and data (time permitting)

Build models and run topic extraction

Partition test collections into 10 random training (90%) and test (10%) collections
For each collection (biosis, pubmed, dryad)
For each input type (title, title+abs, title+abs+kw)
For each partition (0-9)
For each stemmer (None, Lovins, Sremoval, Porter)
For minoccur (1,2)
Create new model

java -$heapsize maui.main.MauiModelBuilder -l partitions/$collection/$format/part$part/train/ -m partitions/$collection/$input_type/part$part/$model
  -v mesh_20110305 -f skos -i en -e utf-8 -x 5 -y 1 -o $minoccur -t $stemmer -d >& logs/$collection-$format-$model.log

Run topic extraction (n = 20) with IDF feature disabled

java -$heapsize maui.main.MauiTopicExtractor -l partitions/$collection/$format/part$part/test/ -m partitions/$collection/$input_type/part$part/$model
  -v mesh_20110305 -f skos -i en -e utf-8 -n 20 -t $stemmer -d >> results/$collection-$format-$part-$model.out

Run topic extraction (n = 20) with IDF feature enabled

java -$heapsize maui.main.MauiTopicExtractor -l partitions/$collection/$format/part$part/test/ -m partitions/$collection/$input_type/part$part/$model
  -v mesh_20110305 -f skos -i en -e utf-8 -n 20 -t $stemmer -b -d >> results/$collection-$format-$part-$model-tfidf.out

Maui reports precision@n, recall@n, F@n. Results are averaged across the 10 test collections.

Results

This section presents the results of Maui indexing (default Naive Bayes classifier) using the MeSH vocabulary with each test collection and k=10. Maui achieves maximum recall of 43.81% of Dryad depositor-supplied subject keywords as mapped to MeSH.

Dryad

Measure	Value	Configuration
P@10	24.84	title, Sremoval stemmer, min occur=1
R@10	43.81	title+abs+kw, No stemmer, min occur=1
F@10	29.10	title+abs+kw, Sremoval, min occur=1

Maui achieves maximum 43.81% recall for Dryad depositor-supplied keywords as manually mapped to MeSH terms with k=10.

BIOSIS

Measure	Value	Configuration
P@10	13.51	title, Lovins stemmer, min occur=1
R@10	11.08	title+abs+kw, Porter stemmer, min occur=1
F@10	11.60	title+abs+kw, Porter stemmer, min occur=1

Maui achieves only 11.08% recall for BIOSIS concept codes as manually mapped to MeSH terms with k=10.

PubMed

Measure	Value	Configuration
P@10	26.47	title+abs+kw, Lovins stemmer, min occur=1
R@10	32.26	title+abs+kw, Lovins stemmer, min occur=1
F@10	28.95	title+abs+kw, Lovins stemmer, min occur=1

Maui achieves 32.26% recall for PubMed MeSH headings with k=10.

For full results, see File:Maui Results.pdf