LEADS Blog

Week 7-8 – Sonia Pascua, The SKOS of 1910 LCSH in RDF/XML format

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading
Technically the project output is accomplished this week, the SKOS of the 1910 LCHS in machine readable format, RDF/XML. However, to integrate the 1910 LCSH vocabulary which is now in RDF/XML to HIVE for the use of automatic indexing, is also one of the goals of this project.
The last two weeks of the project will be on the parsing of the SKOS elements to map to the database fields of HIVE. Moreover, vocabularies are added to the database to build the LCSH db. Once LCSH db is available, SQL scripts and queries of HIVE  should be able to retrieve the data and use the indexing capabilities of HIVE.
See screenshot below of the 1910 LCSH SKOS.
Furthermore, below are the challenges that this project encountered:
  • Digitization – The TEI version of the 1910 LCSH encountered incompleteness therefore we need to go back to the digitization of the print copies and re-do the OCR process.
  • Encoding – Parsing, which is one of the activities done in this project encountered not only syntactic and basic semantic structure error but also logic and syntax/semantics interaction.
  • Programming
    • Characterizing the states if possible and be able to enumerate all of them so that a conditional statement can be composed.
    • Data is unclean that pattern is hardly identified for logic formulation.
  • Digitalization – MultiTes or Python Program
    • MultiTes usage which is manual process but yields 98% accuracy in terms of reppresentation
    • Building of a program (Python) to automate the SKOS creation from TEI format to RDF/XML format encountered pattern recognition challenges due to regular expression brought by the OCR process. This yielded higher percentage of error which were identified from the 47 inconsistencies found in the evaluation conducted when the control structures of the program was constructed. Further investigation could verify the percent error yield once compared to MultiTes version of SKOS RDF/XML.
  • Metadata – SKOS elements are limited to Concept, PrefLabel, Related and Notes. AltLabel, USE, USE FOR, BT and NT are not represented because HIVE database has no provision for them. 

The SKOS-ification of the 1910 LCHS brought a lot of challenges that we documented to contribute to the case studies in digitization, encoding, programming, digitalization and metadata practices.

Leave a Reply

Your email address will not be published. Required fields are marked *