Tag: Digital Scholarship Center at Temple University
Week 7-8 – Sonia Pascua, The SKOS of 1910 LCSH in RDF/XML format
- Digitization – The TEI version of the 1910 LCSH encountered incompleteness therefore we need to go back to the digitization of the print copies and re-do the OCR process.
- Encoding – Parsing, which is one of the activities done in this project encountered not only syntactic and basic semantic structure error but also logic and syntax/semantics interaction.
- Programming
- Characterizing the states if possible and be able to enumerate all of them so that a conditional statement can be composed.
- Data is unclean that pattern is hardly identified for logic formulation.
- Characterizing the states if possible and be able to enumerate all of them so that a conditional statement can be composed.
- Digitalization – MultiTes or Python Program
- MultiTes usage which is manual process but yields 98% accuracy in terms of reppresentation
- Building of a program (Python) to automate the SKOS creation from TEI format to RDF/XML format encountered pattern recognition challenges due to regular expression brought by the OCR process. This yielded higher percentage of error which were identified from the 47 inconsistencies found in the evaluation conducted when the control structures of the program was constructed. Further investigation could verify the percent error yield once compared to MultiTes version of SKOS RDF/XML.
- Metadata – SKOS elements are limited to Concept, PrefLabel, Related and Notes. AltLabel, USE, USE FOR, BT and NT are not represented because HIVE database has no provision for them.
The SKOS-ification of the 1910 LCHS brought a lot of challenges that we documented to contribute to the case studies in digitization, encoding, programming, digitalization and metadata practices.
Week 6 – Sonia Pascua – Parser, Python & Mapping
Finally I met my mentor, Peter Logan last Monday, and it was great to see him in person. In this meeting I presented the progress of the project and figured out that perhaps a TEI format would be a good data format for me to move forward. As pending action item, TEI format will be generated and provided by Peter.
- I was able to make a parser code in Python to extract the elements from SKOS RDF/XML format of the 1910 LCSH
- Concerted assessment of Jane, Peter and I resulted to the following
-
- There is an entry in LCSH that has multiple SEE terms that when converted to SKOS RDF/XML using MultiTes, only the first term is accounted as PrefLable and the rest fell into AltLabel. How SEE should be represented is seen as a challenge. Based on LCSH, concept with SEE tag should use the SEE term as subject heading. It is the case in the first term in the SEE tag. It became the PrefLabel. However, AltLabel is used as the tag for the succeeding SEE terms and it is seen as an incorrect representation. Multiple PrefLables are going to be explored. Can it be done? Wouldn’t it violate the LCSH or SKOS rules? I need to conduct further investigation on this.
- It is decided for now that USE : will be transferred to AltLabel; We will set a meeting with Joan, the developer of HIVE, how USE and Use for will be represented in HIVE.
- I brought up about some alphanumeric words in 1910 LCSH that is a recognized Library of Congress Classification number. Should it still be needed to be represented? As per Jane, they can be kept as Notes.
- I need also to investigate how BT and NT are going to be represented both in SKOS and in HIVE DB.
- The current SKOS RDF/XML at hand, shows the different SKOS elements that some have no representation in HIVE. To address this, we will bring this concern to Joan and consult with her on how this can be added or mapped with the existing HIVE DB fields.
- Now that the text file is the input in the parser script I wrote, it is recommended to work on a text file of the 1910 LCSH. Peter to provide the TEI format.
- There is an entry in LCSH that has multiple SEE terms that when converted to SKOS RDF/XML using MultiTes, only the first term is accounted as PrefLable and the rest fell into AltLabel. How SEE should be represented is seen as a challenge. Based on LCSH, concept with SEE tag should use the SEE term as subject heading. It is the case in the first term in the SEE tag. It became the PrefLabel. However, AltLabel is used as the tag for the succeeding SEE terms and it is seen as an incorrect representation. Multiple PrefLables are going to be explored. Can it be done? Wouldn’t it violate the LCSH or SKOS rules? I need to conduct further investigation on this.
Additionally, earlier today, LEADS-4-NDP 1-minute madness was held. I presented the progress of the project to co-fellow and the LEADS-4-NDP advisory board.
Week 5: Sonia Pascua – Project progress report
- Digitized 1910 LCSH was converted in Docx format by Peter
- I was able to run the HIVE code in the local computer for code exploration
- A sample db in HIVE is composed of 3 tables. Below is the LCHS db in HIVE
- I was able to create the 1910 LCHS thesaurus for letter A in page 1 using MultiTes
- I generated the html of the 1910 LCSH Multites Thesaurus
- I also generated the RDF/XML format of the thesaurus
- I am looking at the solution for the project.
- How will the Docx format of 1910 LCHS be converted to RDF automatically?
- How will the Docx format of 1910 LCHS be loaded to HIVE DB automatically?
- Which solution to take given the limited time
- SKOS in HIVE have limited elements of the standard SKOS
- To explore MultiTes in the automation of converting 1910 LCSH Doc to RDF
- To explore other tools in the automation of converting 1910 LCHS Doc to RDF
- To explore the HIVE code in the automation of loading 1910 LCSH DOC to HIVE db
Week 3-4: Sonia Pascua, The Paper and the proposal
In the past weeks, I was able to progress by co-authoring a paper with Jane Greenberg, Peter Logan and Joan Boone. We’re able to submit the paper entitled “SKOS of the 1910 Library of Congress Subject Heading for the Transformation of the Keywords to Controlled Vocabulary of the Nineteenth-Century Encyclopedia Britannica” to NKOS 2019 which will be held in Dublin Core Conference 2019 in South Korea on Sept 23 -26, 2019. We couldn’t wait the acceptance of the paper hoping that this research has a novelty in the field of Simple Knowledge Organization Systems (SKOS).
- Digitized format of 1910 LCHS is converted to text format to help in the manipulation of texts and words. This has been done already by Peter. The 1910 LCSH in digitized format which was made available by Google under the HathiTrust project is composed of 2 volumes. In the text format (.docx), volume 1 is composed of 363 pages and volume 2 has 379 pages.
- Vocabularies are assessed to identify the structures and relationships of the vocabularies in 1910 LCHS and be able to be mapped to the elements and syntax of the SKOS vocabulary. These elements and syntax have integrity conditions that are used as a guideline for best practices in constructing SKOS vocabularies.
- Processes, methods and methodology are documented and tested for reproducibility and replication purposes. The project will run for 10 weeks and it’s challenging to be able to complete the SKOS-ination of the entire 2 volumes of the 1910 LCHS. However, if the processes, tools, techniques and guides are available, the project could be continued and knowledge could be transferred to completely finish the SKOS of the 1910 LCHS.
- Tools to be used in building the SKOS of the 1910 LCHS and in automating its creation processes, are seen to be one of the vital output of this endeavor.
- Frazier, P. (2015, August 11). SKOS: A Guide for Information Professionals. Retrieved July 9, 2019, from http://www.ala.org/alcts/resources/z687/skos. Association for Library Collections and Technical Services, American Library Association
- HathiTrust: Home. (n.d.). Retrieved July 9, 2019, from www.hathitrust.org/. HathiTrust Digital Library
- Logan, P. (n.d.). Nineteenth-Century Knowledge Project. Retrieved July 9, 2019, from tu-plogan.github.io/. Digital Scholarship Center, Temple University
- SKOS Simple Knowledge Organization System – Home Page. (n.d.). Retrieved July 9, 2019, from https://www.w3.org/2004/02/skos/. Semantic Web Deployment Working Group, World Wide Web Consortium (W3C)
Week 2-3: Sonia Pascua, I am one of the “Mix” ‘s in the Metadata Mixer