Taxonomic Indexing

HIVE Overview | Demo | HIVE Community | Publications | Archives

Background

Maui is not intended to support taxonomic indexing. The Maui matching algorithm is not well-suited for the identification of taxonomic names in text and the filtering algorithm does not support disambiguation. NER techniques such as Neti Neti have proven effective at identifying taxonomic names in text, but do not specifically support matching to a controlled vocabulary, such as ITIS.

Question

How can we identify and index taxonomic names using a controlled vocabulary (ITIS)?
How does a simple thesaurus-based matching process compare to NER and Maui?

Test Collections

Data: title, abstract, keywords (and data, time permitting)

Collection	Records	Terms
Dryad	117 records	240 terms; 220 unique author-supplied names mapped to 197 unique ITIS TSNs
BIOSIS Previews	150 records	247 terms; 126 BIOSIS classifiers mapped to 123 unique ITIS TSNs
PubMed	168 records	230 terms; 129 unique MeSH descriptors mapped to 110 unique ITIS TSNs

Mapping summary

The test collection consists of Dryad depositor-supplied keywords along with indexing from BIOSIS Previews and PubMed mapped to taxonomic names in ITIS.

The results of the mapping process are summarized below. In the cases of BIOSIS and PubMed, over 90% of the index terms where found in ITIS either with a match on the preferred or alternate term. Only 2% of index terms were mapped to broader terms (ranks) in ITIS.

In the case of Dryad depositor-supplied terms, 80% were found in ITIS and 16% were mapped to broader terms in ITIS. This indicates that depositors select narrower terms than are represented in ITIS.

Match type	BIOSIS	PubMed	Dryad
Matched preferred	86%	58%	77%
Matched alternate	9%	32%	0%
Alternate spelling	1%	2%	3%
Matched broader	2%	3%	16%
Matched narrower	0%	0%	2%
No match	2%	7%	2%

Distribution of terms by taxonomic rank

The following chart presents the distribution of assigned ITIS terms by taxonomic rank for each collection. The ITIS terms were assigned by mapping author and indexer assigned terms to ITIS. The results of the mapping process are presented above.

Terms assigned from the BIOSIS and PubMed vocabularies are primarily from the Class, Order, and Family taxonomic ranks. Dryad depositors assign more terms from the Species and Genus ranks. The automatic indexing algorithm tested here most frequently assigns terms from the Species rank.

This suggests that, to maximize precision and recall as compared to BIOSIS and PubMed indexing, Class, Order, and Family ranks should be accounted for in addition to Species and Genus.

Algorithm

Matching

Normalize text: remove punctuation, remove stopwords (optional), and stem (optional)
Generate all n-grams of length min-ngram (1) to max-ngram (4)
Order ngrams by descending length (look at the longest ngrams first)
Only consider ngrams where the first letter of the first word is capitalized
Lookup n-gram in ITIS (preferred term and alternate term or common name)
For matched n-grams, stored the following attributes:
- Frequency (int: how many times has this ITIS term been matched?)
- Ambiguity (int: how many entries in ITIS match this ngram?)
- Match length (int: how long is the matched ngram)
- Match preferred (boolean: is this an exact match on the preferred label?)
- Match case (boolean: does the ngram match the ITIS term case?)
- Term level (int: what is the level/depth of this term in ITIS?)
- N-gram (string: the exact string that was matched in the text)

Calculate connectedness/degree

Determine relationships between all candidate terms output from the matching process.
Store the maximum connected path length between each term and any other term.

Disambiguate

For each candidate, if ambiguous (multiple ITIS terms match the original n-gram)
Score based on linear combination of ambiguity, match length, match preferred, match case, degree, and level.
Select top term

Rank disambiguated candidates

For each candidate, score based on linear combination of features
- Frequency
- Match length
- Match preferred
- Match case
- Degree
- Level

Evaluation

Compare final list of ranked terms to “gold standard” test collections.
Calculate P@K, R@K, and F@K
Precision =
- (number of good terms returned)/(total number of terms returned)
Recall =
- (number of good terms returned)/(total number of good terms)
F-Measure =
- 2 * (precision * recall) / (precision + recall)

Results

The following table presents the results of the automatic indexing algorithm described above on the three test collections with k=10. The automatic indexing process generally returns lower-level ranks (i.e., Genus and Species). Including the order and class for each term results in recall@10 of 50.25% compared to BIOSIS indexing, 66.59% for PubMed indexing, and 84.84% for Dryad indexing.

Config	PubMed	Dryad
	P@10	R@10	F@10	P@10	R@10	F@10	P@10	R@10	F@10
Automatic	5.46	31.33	8.99	10.36	61.53	17.30	11.81	84.41	20.07
Auto+Order	6.71	44.08	11.02	10.65	66.59	17.92	11.98	85.27	20.36
Auto+Class+Order	7.45	50.25	12.28	10.77	67.57	18.09	11.90	84.84	20.21

TBD: Results of Maui algorithm using same collections

TBD: Results of automatic indexing algorithm with NER (Neti Neti) input