Week 5: Sonia Pascua – Project progress report

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading


I. Project update
  • Digitized 1910 LCSH was converted in Docx format by Peter
  • I was able to run the HIVE code in the local computer for code exploration
  • A sample db in HIVE is composed of 3 tables. Below is the LCHS db in HIVE
  • I was able to create the 1910 LCHS thesaurus for letter A in page 1 using MultiTes
  • I generated the html of the 1910 LCSH Multites Thesaurus

  • I also generated the RDF/XML format of the thesaurus
  • I am looking at the solution for the project. 
    • How will the Docx format of 1910 LCHS be converted to RDF automatically?
    • How will the Docx format of 1910 LCHS be loaded to HIVE DB automatically?
II. Concerns / Issues / Risks
  • Which solution to take given the limited time
  • SKOS in HIVE have limited elements of the standard SKOS
III. Pending action item
  • To explore MultiTes in the automation of converting 1910 LCSH Doc to RDF
  • To explore other tools in the automation of converting 1910 LCHS Doc to RDF
  • To explore the HIVE code in the automation of loading 1910 LCSH DOC to HIVE db

3 thoughts on “Week 5: Sonia Pascua – Project progress report”

  1. Sonia, Thanks for the detailed information. I was interested in the Hive application that you are using. I’m working on a project this summer to qualitatively analyze some text. It is quite a tedious task and I wondered if running the text files through the index component of Hive would help me identify some recurring themes. The text is focused on library education so I selected the MESH vocabulary and had a few matches. I think something like this, if properly tweaked and developed, would be useful. In our case, we have three people working on the texts, so the standard definitions provided would help us all stay on the same page.

  2. Bridget, it may very well work for you, actually. As it so happens, MESH is already integrated into HIVE! Within the indexing portion of HIVE, you upload your document (or URL), select which controlled vocabularies you would like to use (in your case, MESH), adjust any of the algorithm parameters as need be (if you are working with very short or very long documents…I can talk to you about this, if you would like), and then press “Index.” HIVE’s indexing functions operates with two algorithms. The first is a keyword extraction algorithm (RAKE), which will extract natural language keywords based on frequency and co-occurrences. The second algorithm will take those natural language keywords and then map them to controlled vocabulary terms using wildcards and stemming. HIVE is best used as a semi-automated method with human intervention to determine which terms are most relevant for you. If you click on a term in your results, you can format your results in various ways, including RDF SKOS.

    You may want to give it a try! http://hive2.cci.drexel.edu:8080/indexer

Leave a Reply

Your email address will not be published. Required fields are marked *