LEADS Blog

Week – 9 Sonia Pascua – 1910 LCSH Database Schema

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading
The next traction achieved in this project was when the 1910 LCSH concepts were loaded to a database. Below are the screenshots of the CONCEPT table with the records which are the concepts of 1910 LCSH. This created database named “lchs1910.db”, is added into the list of vocabulary databases in HIVE. Next steps are to formulate a test case which will be provided by Peter and execute a query to check the results. It is also considered the loading of the created RDF or db to the live HIVE and Joan Boone, the developer of HIVE is on the assist. Couldn’t wait the end output of the testing and the live 1910 LCSH.
Volume 1 – Database Schema Letters A-F
Volume 2 – Database Schema Letters G-P
Volume 3 – Database Schema Letters S-Z
LEADS Blog

Week 7-8 – Sonia Pascua, The SKOS of 1910 LCSH in RDF/XML format

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading
Technically the project output is accomplished this week, the SKOS of the 1910 LCHS in machine readable format, RDF/XML. However, to integrate the 1910 LCSH vocabulary which is now in RDF/XML to HIVE for the use of automatic indexing, is also one of the goals of this project.
The last two weeks of the project will be on the parsing of the SKOS elements to map to the database fields of HIVE. Moreover, vocabularies are added to the database to build the LCSH db. Once LCSH db is available, SQL scripts and queries of HIVE  should be able to retrieve the data and use the indexing capabilities of HIVE.
See screenshot below of the 1910 LCSH SKOS.
Furthermore, below are the challenges that this project encountered:
  • Digitization – The TEI version of the 1910 LCSH encountered incompleteness therefore we need to go back to the digitization of the print copies and re-do the OCR process.
  • Encoding – Parsing, which is one of the activities done in this project encountered not only syntactic and basic semantic structure error but also logic and syntax/semantics interaction.
  • Programming
    • Characterizing the states if possible and be able to enumerate all of them so that a conditional statement can be composed.
    • Data is unclean that pattern is hardly identified for logic formulation.
  • Digitalization – MultiTes or Python Program
    • MultiTes usage which is manual process but yields 98% accuracy in terms of reppresentation
    • Building of a program (Python) to automate the SKOS creation from TEI format to RDF/XML format encountered pattern recognition challenges due to regular expression brought by the OCR process. This yielded higher percentage of error which were identified from the 47 inconsistencies found in the evaluation conducted when the control structures of the program was constructed. Further investigation could verify the percent error yield once compared to MultiTes version of SKOS RDF/XML.
  • Metadata – SKOS elements are limited to Concept, PrefLabel, Related and Notes. AltLabel, USE, USE FOR, BT and NT are not represented because HIVE database has no provision for them. 

The SKOS-ification of the 1910 LCHS brought a lot of challenges that we documented to contribute to the case studies in digitization, encoding, programming, digitalization and metadata practices.

LEADS Blog

Week 6 – Sonia Pascua – Parser, Python & Mapping

 

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading

 

Finally I met my mentor, Peter Logan last Monday, and it was great to see him in person. In this meeting I presented the progress of the project and figured out that perhaps a TEI format  would be a good data format for me to move forward. As pending action item, TEI format will be generated and provided by Peter.

Here are some of the matters to ponder on in this project.
  • I was able to make a parser code in Python to extract the elements from SKOS RDF/XML format of the 1910 LCSH
  • Concerted assessment of Jane, Peter and I resulted to the following
The sample entry from LCSH
 



SKO RDF version


Assessment:
Concept : Abandoned children 
PrefLabel: first SEE instance 
USE: succeeding SEE instances – Foundlings & Orphans and orphan-asylums
    • There is an entry in LCSH that has multiple SEE terms that when converted to SKOS RDF/XML using MultiTes, only the first term is accounted as PrefLable and the rest fell into AltLabel. How SEE should be represented is seen as a challenge. Based on LCSH, concept with SEE tag should use the SEE term as subject heading. It is the case in the first term in the SEE tag. It became the PrefLabel. However, AltLabel is used as the tag for the succeeding SEE terms and it is seen as an incorrect representation. Multiple PrefLables are going to be explored. Can it be done? Wouldn’t it violate the LCSH or SKOS rules? I need to conduct further investigation on this.
    • It is decided for now that USE : will be transferred to AltLabel; We will set a meeting with Joan, the developer of HIVE, how USE and Use for will be represented in HIVE.
    • I brought up about some alphanumeric words in 1910 LCSH that is a recognized Library of Congress Classification number. Should it still be needed to be represented? As per Jane, they can be kept as Notes.
    • I need also to investigate how BT and NT are going to be represented both in SKOS and in HIVE DB.
    • The current SKOS RDF/XML at hand, shows the different SKOS elements that some have no representation in HIVE. To address this, we will bring this concern to Joan and consult with her on how this can be added or mapped with the existing HIVE DB fields. 
    • Now that the text file is the input in the parser script I wrote, it is recommended to work on a text file of the 1910 LCSH. Peter to provide the TEI format.

Additionally, earlier today, LEADS-4-NDP 1-minute madness was held. I presented the progress of the project to co-fellow and the LEADS-4-NDP advisory board.

 

LEADS Blog

Week 5: Sonia Pascua – Project progress report

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading

 

I. Project update
  • Digitized 1910 LCSH was converted in Docx format by Peter
  • I was able to run the HIVE code in the local computer for code exploration
  • A sample db in HIVE is composed of 3 tables. Below is the LCHS db in HIVE
  • I was able to create the 1910 LCHS thesaurus for letter A in page 1 using MultiTes
  • I generated the html of the 1910 LCSH Multites Thesaurus

  • I also generated the RDF/XML format of the thesaurus
  • I am looking at the solution for the project. 
    • How will the Docx format of 1910 LCHS be converted to RDF automatically?
    • How will the Docx format of 1910 LCHS be loaded to HIVE DB automatically?
II. Concerns / Issues / Risks
  • Which solution to take given the limited time
  • SKOS in HIVE have limited elements of the standard SKOS
III. Pending action item
  • To explore MultiTes in the automation of converting 1910 LCSH Doc to RDF
  • To explore other tools in the automation of converting 1910 LCHS Doc to RDF
  • To explore the HIVE code in the automation of loading 1910 LCSH DOC to HIVE db
LEADS Blog

Week 3-4: Sonia Pascua, The Paper and the proposal

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading

In the past weeks, I was able to progress by co-authoring a paper with Jane Greenberg, Peter Logan and Joan Boone. We’re able to submit the paper entitled “SKOS of the 1910 Library of Congress Subject Heading for the Transformation of the Keywords to Controlled Vocabulary of the Nineteenth-Century Encyclopedia Britannica” to NKOS 2019 which will be held in Dublin Core Conference 2019 in South Korea on Sept 23 -26, 2019. We couldn’t wait the acceptance of the paper hoping that this research has a novelty in the field of Simple Knowledge Organization Systems (SKOS).

This paper was also the starter in discussing what could really be the approaches in the SKOS-ination of 1910 LCHS.
This week I met with my mentor Peter for our weekly cadence. Scope was clarified and nailed down in this meeting.The project aims to transform the digitized 1910 LCHS to SKOS. Peter had shared the text file of the digitized 1910 LCHS and we’re able to discuss what could be the possible approaches for me to be able to execute my task. I appreciated the expertise of my mentor in handling the project and a mentee, like me. He made an effort to synchronize the concepts between us. We dwelled on the appropriate understanding between “keyword” and “index term” which I believe is very critical in building a thesaurus in SKOS. As I have presented to him my plan of execution, below are the paces we looked at to achieve the goal of the project:
  • Digitized format of 1910 LCHS is converted to text format to help in the manipulation of texts and words. This has been done already by Peter. The 1910 LCSH in digitized format which was made available by Google under the HathiTrust project is composed of 2 volumes. In the text format (.docx), volume 1 is composed of 363 pages and volume 2 has 379 pages.
  • Vocabularies are assessed to identify the structures and relationships of the vocabularies in 1910 LCHS and be able to be mapped to the elements and syntax of the SKOS vocabulary. These elements and syntax have integrity conditions that are used as a guideline for best practices in constructing SKOS vocabularies. 
  • Processes, methods and methodology are documented and tested for reproducibility and replication purposes. The project will run for 10 weeks and it’s challenging to be able to complete the SKOS-ination of the entire 2 volumes of the 1910 LCHS. However, if the processes, tools, techniques and guides are available, the project could be continued and knowledge could be transferred to completely finish the SKOS of the 1910 LCHS.
  • Tools to be used in building the SKOS of the 1910 LCHS and in automating its creation processes, are seen to be one of the vital output of this endeavor. 
For the moment, I have started reading the W3C Semantic Web and ALA guides to understand the methodologies and methods is constructing SKOS. In the search of the tools, MultiTes with which MRC has acquired license, will be started to explored.
My personal desire is not only to SKOSify 1910 LCHS but also to document the processes in finding the appropriate approach, techniques and tools that could be used by and shared not only to Digital Scholarship Center but also to other entities of the same project goal and objective. SKOS is a representation that is readily consumed on the web and allows vocabulary creators to publish born-digital vocabularies on the web. [Frazier, 2013].
References:
  1. Frazier, P. (2015, August 11). SKOS: A Guide for Information Professionals. Retrieved July 9, 2019, from http://www.ala.org/alcts/resources/z687/skos. Association for Library Collections and Technical Services, American Library Association
  2. HathiTrust: Home. (n.d.). Retrieved July 9, 2019, from www.hathitrust.org/. HathiTrust Digital Library
  3. Logan, P. (n.d.). Nineteenth-Century Knowledge Project. Retrieved July 9, 2019, from tu-plogan.github.io/. Digital Scholarship Center, Temple University
  4. SKOS Simple Knowledge Organization System – Home Page. (n.d.). Retrieved July 9, 2019, from https://www.w3.org/2004/02/skos/. Semantic Web Deployment Working Group, World Wide Web Consortium (W3C)
LEADS Blog

Week 2-3: Sonia Pascua, I am one of the “Mix” ‘s in the Metadata Mixer

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading

 

Last June 13, 2019, I presented our LEADS-4-NDP project at the Metadata Mixer.

 I started my lighting talk discussing the bigger picture of our project.

The Digital Scholarship Center has an ongoing project which is the Nineteenth-Century Knowledge Project that builds the most extensive, open, digital collections available today for studying the structure of the 19th Century knowledge and transformation using historic editions of the Encyclopedia Britannica. This project is progressing hugely towards establishing the controlled vocabulary terms for the purpose of metadata consistency and interoperability and is utilizing vocabularies in HIVE especially LCHS.

Our project works around the SKOS – ination of the 1910 LCHs.

The hypothesis that we would like to explore is that there may be gap or we call it “vocabulary divide” between the vocabularies of the past and the present. With the current version of LCHS (2016) in HIVE, we aim to include the 1910 version of LCHS to cater the researches using resources from the past especially the 19th century knowledge.

Above is our conceptual model. As shown, the 1910 LCHS would be digitized to text format for easy manipulation of words. Then from the text, be it in csv, xls, DocX format – the RDF/XML format is constructed for HIVE integration. Once the 1910 LCHS is into HIVE, it could now be used as a tool for automatic indexing.
In the 5-min talk, I was able to present the proof of concept
We formulated use cases based on the data sets – 1910 LCHS and 2016 LCHS. Four scenarios were devised for data analysis.The gap or “vocabulary divide” is verified and validated by these use cases. 
A simulation of a word – Absorption was conducted. The article about the sun was taken from the 1911 Encyclopedia Britannica. It was subjected to a text analysis using TagCrowd. Frequencies of the words in the article were extracted. For subject cataloging, which was done manually, the descriptors were selected to represent the ABOUTNESS of the article. 1910 LCHS was used for indexing and vocabulary was generated. The same process was executed but this time with the use of 2016 LCHS in HIVE for automatic indexing. The case study fell under scenario 2 which meant that the word “Absorption” intersected both data sets, thus the word existed from 1910 till 2016.
LEADS Blog

Week 1: Sonia Pascua, I am a LEADS-4-NDP 2019 Fellow

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading
                 As I am so privileged that I am one of the LEADS-4-NDP fellows for this year grant. My placement is with the Digital Scholarship Center of Temple University and my mentor is Peter Logan. Currently, we are at the project proposal stage and establishing proof of concept. We’re looking at a paper too to be one of our outputs which we target to submit to a conference like NKOS or Dublin Core.
                 As a fellow, I was included in the recent 3-day Data Science boot camp held at our University, Drexel University. As I posted it to LinkedIn, I was really excited to learn and to meet co-fellows in this boot camp. The days had gone by so quickly for this great endeavor. Nonetheless, I had a good account of my experience with this boot camp.
Day 1 was a full pack lecture and getting to know co-fellow and our respective projects. Our ice breaker was fantastic. It gave us the opportunity to know participants in a more fun way by asking a couple of questions to a partner then presented to everyone in the room what you’d found. It revealed exciting facts about co-fellow and broke rigidity amongst ourselves. From that moment on I felt comfortable with everyone.  
Lectures on Intro to Data Science by Prof. Erjia and Big Data Management by Prof. Il-Yeong, both from CCI were inspiring especially when they shared their own comprehension of concepts. I liked how Prof Erjia started with “A hundred people will have a hundred definitions of Data science (DS)…” which gave the right understanding on why there’s different treatment experienced in the DS field. I liked too how he drilled on the multidisciplinary skills needed by a modern data scientist and coached us that we should be getting just one skill and be good at it; that it would be hard to work on all four skillsets (Mathematics and Statistics, Programming and Databases, Domain Knowledge and Soft skills, Communications and Visualizations) and be the jack of all trades to them. This may end you up master of none which is not fruitful for a career. As an academic researcher, it’s advisable to boast of one skill and be a good part of a team in a DS endeavor. I appreciated Prof Erjia’s list of biases which I believe if understood, could be keys to overcoming challenges encountered DS.
On the other note, Prof Il-Yeong did expose a lot of compendium account of what happened through time in the database field. His story of “Old SQL to NO SQL to New SQL” was awesome. It provided an understanding of what we have now. It’s also great experiencing validation of what I was teaching. Hearing the database from an “antiqua” person. Don’t get me wrong. For me, “antiqua” term is full of respect and admiration. In my 10 years of teaching database, only a handful of people whom I regard as knowledgeable of the heart and soul of database and Il-Yeong is one of them.
Data Science talk of one of the mentors, Dr. Jean Godby, a senior research scientist at OCLC, was precious. She laid a good perspective to understand data science challenges and promises.
That day ended with our group dinner at Han Dynasty. We were joined by the Department Head of CCI Drexel University, Dr. Xia and Dr. Michelle Rogers and Dr. Peter Logan, one of the mentors of the LEADS-4-NDP Project and the director of Digital Scholarship Center which is my placement.
Day 2 as well as day 3, I should say were another stretches of lectures together with workshop in R. We got our hands dirty with the coding and building of our tech skill in the basics of R. Various topics ran from data pre-processing, data visualization and visual analytics, data mining and machine learning II to text processing and mini-workshop on BigML, a code-free tool for Automated Data Analytics. Dr. Richard Marciano did a small Data Science talk and presented the projects he and Digital Curation Innovation Center (DCIC) were working on. Additionally, Dr. Jane Greenberg delivered her presentation on metadata, data quality, and metadata integration.
I will miss the fellows. We had not gotten much time to really get to know each other but by heart, they are colleague and cohorts whom I can work with in this research journey of my life. I wish all of our successes in all our projects. Looking forward to our virtual meeting because we’re all working in Summer but from different states. How I wish we got time for bonding and trips.