HIVE/LC Web Archives Evaluation

HIVE Overview | Demo | HIVE Community | Publications | Archives

Goal

Simultaneously optimize precision and recall for automated subject term suggestion for web archives using Library of Congress Subject Headings (LCSH)

What is the best configuration for automatic indexing of web archives using the Library of Congress Subject Headings (LCSH)?

Parameters: number of hops, differencing enabled/disabled, Maui parameters.

Evaluation environment

hive.nescent.org
/home/craigwillis/minerva
- harvest.sh: Wrapper around SimpleCrawler to create the 8 test collections
- partition.sh: Wrapper around PartitionUtil to partition the 8 test collections into 10 test/training sets
- build_models.sh: Wrapper around MauiModelBuilder to build models for each training set
- extract_topics.sh: Wrapper around MauiTopicExtractor to extract topics and capture statistics
- data/: Directory contains partitioned test collections from partition.sh, stopwords, and lcsh skos vocabulary
- input/: Directory contains LC provided spreadsheet
- webarchives/: Directory contains 8 test collections from harvest.sh

Evaluation process

Harvest content
Partition test collection
Build models
Run test, capture output

Prepare data

LC has provided a spreadsheet “hive_test_training_sets_11182011.xls” that contains the URL for each test site and associated LCSH headings. Per email from LC, we will initially test indexing with the full LCSH vocabulary instead of the government policy subset.

One problem that has been identified is that the pre-indexed sites may include free-floating subdivisions, which will cause Maui to fail during lookup. We need to consider approaches to identifying and handling these cases.

Harvest

Using the SimpleCrawler and test collection URL spreadsheet provided by LC, harvest content with the following parameters:

Differencing: enabled/disabled
Number of hops: 0, 1, 2, 3

There will be 8 different test collections with crawled text content and associated key files.

Draft implementation in harvest.sh

Partition

Partition each test set into 10 different test/training sets:

java PartitionUtil input_dir output_dir num_partitions

Draft implementation in partition.sh

Create models

For each partition, create models for the following conditions:

Minimum number of occurrences: 1, 2
Stopwords class: none, english
Stemmer class: none, maui.stemmers.SremovalStemmer, maui.stemmers.PorterStemmer, maui.stemmers.LovinsStemmer

Maui options:

java maui.main.MauiModelBuilder
  -l path to training set
  -m name of model
  -v name of vocabulary
  -f format of vocabulary (always skos)
  -i document language (always en)
  -e encoding (always utf-8)
  -w Wikipedia miner server (wikipediaDatabase@wikipediaServer)
  -d debugging
  -x maximum phrase length (constant for LCSH)
  -y minimum phrase length (constant for LCSH)
  -o minimum number of occurrences (1 or 2)
  -s name of stopwords class (either not specified or StopwordsEnglish)
  -t name of stemmer class (not specified, SremovalStemmer, PorterStemmer, LovinsStemmer)

For example: java maui.main.MauiModelBuilder -v lcsh -f skos -i en -e utf-8 [-w d@s] -x 5 -y 1 -o {1,2} -s {none,english} -t {non, sremoval, porter, lovins} -l /lcweb/diff-1/part0/train -m diff-1-0-1-none-sremoval

Where the model name is: [diff, nodiff]-[num hops]-[part no]-[min occur]-[stopword class]-[stemmer]

For the 8 input collections with 10 partitions per collection and 1 model per Maui configuration combination, there will be 1280 distinct models. If we do not use cross validation, there will be 8 input collections x 16 Maui configuration combinations, 128 separate models.

Draft implementation in build_models.sh

Run test

For each model, run the Maui topic extractor and compare precision/recall/F. The only value that may vary is the use of TFIDF flag.

java maui.main.MauiTopicExtractor

-l path to test set
-m name of model
-v name of vocabulary (always lcsh)
-f format of vocabulary (always skos)
-e encoding (utf-8)
-w Wikipedia miner server (wikipediaDatabase@wikipediaServer)
-i language (always en)
-n number of phrases (constant, 20)
-t stemmer class (must match model)
-s stopwords class (must match model)
-d debugging
-b TFIDF?

Running the MauiTopicExtractor with the debug flag will output average precision, recall and F-measure for each run. Collect these stats in a log file for each run.

Draft implementation in extract_topics.sh

Analyze results

Collect parameters and stats for each run. Which configuration has the highest values?