LEADS Blog

Week 5: Sonia Pascua – Project progress report

July 17, 2019July 22, 2019 mrc_team

LEADS site: Digital Scholarship Center

Project title: SKOS of the 1910 Library of Congress Subject Heading

I. Project update

Digitized 1910 LCSH was converted in Docx format by Peter

I was able to run the HIVE code in the local computer for code exploration

A sample db in HIVE is composed of 3 tables. Below is the LCHS db in HIVE

I was able to create the 1910 LCHS thesaurus for letter A in page 1 using MultiTes

I generated the html of the 1910 LCSH Multites Thesaurus

I also generated the RDF/XML format of the thesaurus

I am looking at the solution for the project.
- How will the Docx format of 1910 LCHS be converted to RDF automatically?
- How will the Docx format of 1910 LCHS be loaded to HIVE DB automatically?

II. Concerns / Issues / Risks

Which solution to take given the limited time
SKOS in HIVE have limited elements of the standard SKOS

III. Pending action item

To explore MultiTes in the automation of converting 1910 LCSH Doc to RDF
To explore other tools in the automation of converting 1910 LCHS Doc to RDF
To explore the HIVE code in the automation of loading 1910 LCSH DOC to HIVE db

Published by mrc_team

View all posts by mrc_team

3 thoughts on “Week 5: Sonia Pascua – Project progress report”

Bridget Disney says:

July 25, 2019 at 5:00 pm

Sonia, Thanks for the detailed information. I was interested in the Hive application that you are using. I’m working on a project this summer to qualitatively analyze some text. It is quite a tedious task and I wondered if running the text files through the index component of Hive would help me identify some recurring themes. The text is focused on library education so I selected the MESH vocabulary and had a few matches. I think something like this, if properly tweaked and developed, would be useful. In our case, we have three people working on the texts, so the standard definitions provided would help us all stay on the same page.
Sam Grabus says:

July 25, 2019 at 9:49 pm

Bridget, it may very well work for you, actually. As it so happens, MESH is already integrated into HIVE! Within the indexing portion of HIVE, you upload your document (or URL), select which controlled vocabularies you would like to use (in your case, MESH), adjust any of the algorithm parameters as need be (if you are working with very short or very long documents…I can talk to you about this, if you would like), and then press “Index.” HIVE’s indexing functions operates with two algorithms. The first is a keyword extraction algorithm (RAKE), which will extract natural language keywords based on frequency and co-occurrences. The second algorithm will take those natural language keywords and then map them to controlled vocabulary terms using wildcards and stemming. HIVE is best used as a semi-automated method with human intervention to determine which terms are most relevant for you. If you click on a term in your results, you can format your results in various ways, including RDF SKOS.

You may want to give it a try! http://hive2.cci.drexel.edu:8080/indexer
Bridget says:

August 16, 2019 at 12:44 pm

Thanks Sam! I ran the document through the LCSH and got some interesting results.

Published by mrc_team

3 thoughts on “Week 5: Sonia Pascua – Project progress report”

Leave a Reply