LEADS Blog – Metadata Research Center

Karen Boyd, the 2018 LEADS Fellow who worked on the HSP project last year, and I presented our work at the LEADS Forum on January 24.
Here it is as a GIF: https://gph.is/g/a99OO36

Alyson Gamble, Karen Boyd, LEADS Presentation

And here it is as a PDF: http://shorturl.at/cty48
In case the GIF within a GIF didn’t work here’s a quick example of editing in OpenRefine: https://gph.is/g/ZWdJ3yo

Karen and I will be presenting together again at Code4Lib. Our talk, “Cupper and Leecher, Tinman and Shrimp Fiend: Data Science Tools for Examining Historical Occupation Data,” will be held on March 9. You can read our abstract here: https://2020.code4lib.org/talks/Cupper-and-leecher-tinman-and-shrimp-fiend-Data-science-tools-for-examining-historical-occupation-data

I’ll post our Code4Lib slides after we give the talk.
LEADS has been a wonderful experience, and I’m glad to be able to talk about it to others. Hopefully the lessons from Karen and my experiences will inform another year of work on the HSP project. There’s a lot more to do, but the end results will be useful for a wide audience.
—————————————————-

Alyson Gamble Pronouns: They/Them/Theirs
Doctoral Student, Simmons University
www.mlisgamble.com

Final Post: Julaine Clunis Wrap Up

September 8, 2019January 21, 2020 Sam Grabus

This has been quite an amazing experience for me and I am really very grateful for the opportunity.

As was noted in my previous posts my task was to find a method or approach for matching terms to similar terms in the primary vocabularies and making the terminology more consistent to support analytics.

I explored two methods for term matching.

Method 1

The first method utilized Open Refine and it’s reconciliation services via the API of the focus vocabulary. This method utilized Python script that matched terms in the DPLA dataset with terms from LCSH, LCNAF, and AAT. This method is very time-consuming. Using only a small sample of the dataset consisting of about 796508 terms took about 5-6 hours and returned only about 16% matching terms. (These were exact matches). While this method can definitely be used to find exact matches. Testing should be done to ascertain if the slow speed has to do with the machine and connection specs of the testing machine. However, this method did not prove useful for fuzzy matches. Variant and compound terms were completely ignored unless they matched exactly. Below is an example of the results returned through the reconciliation process.

The scripts used for reconciliation are open source and freely available via GitHub and may be used and modified to suit the needs of the task at hand.

Method 2

The second method involved obtaining the data locally then constructing a workflow inside the Alteryx Data Analytics platform. To obtain the data, Apache Jena was used to convert the N-Triple files from the Library of Congress and the Getty into comma-separated values format for easy manipulation. These files could then be pulled into the workflow.

The first thing that was done was some data preparation and cleaning. Removing leading and trailing spaces, converting all the labels to lowercase and removing extraneous characters. We then added unique identifiers and source labels to the data to be used later in the process. The data was then joined on the label field to obtain exact matches. This process returned more exact match results than the previous method with the same data, and even with the full (not sample) dataset, the entire process took a little under 5 minutes. The data that did not match was then processed through a fuzzy match tool where various algorithms such as key match, Levenshtein, Jaro, or various combinations of these may be used to process the data and find non-exact matches.

Each algorithm returns differing results and more study needs to be given to which method may be best or which combination yields the best and most consistent results.

What is true of all of the algorithms though is that a match score lower than 85% seems to results in matches that are not quite correct, with correct matches interspersed. Although even high match scores using the character Levenshtein algorithm displays this problem with LCSH compound terms in particular. For example, [finance–law and legislation] is being shown as a match with [finance–law and legislation–peru]. While these are similar, should they be considered any kind of match for the purposes of this exercise? If so, how should the match be described?

Character Levenshtein

Character Levenshtein

Still despite the problems, trying various algorithms and varying the match thresholds returns many more matches than the exact match method only. This method also seems useful for matching terms that were using the LCSH compound term style with close matches in AAT. Below are some examples of results

Character: Best of Levenshtein & Jaro

Word: Best of Levenshtein & Jaro

In the second image, we can look at the example with kerosene lamps. In the DPLA data, it seems to have been labeled using the LCSH format as [lamp–kerosene], but the algorithm is showing it is a close match with the term [lamp, kerosene] in AAT.

The results from these algorithms need to be studied and refined more so that the best results can be obtained consistently. I hope to be able to look more in-depth at these results for a paper or conference at some point and come up with a recommended usable workflow.

This is where I was at the end of the ten weeks and I am hoping to find time to look deeper at this problem. I welcome any comments or thoughts and again want to say how grateful I am for the opportunity to work on this project.

—

Julaine Clunis

LEADS Blog

Final post: Kai Li: Wrapping up of OCLC project

September 7, 2019January 21, 2020 Sam Grabus

It’s been a really fast 10 weeks working with OCLC. While I missed quite a few blog posts, the work never stopped. This post will only serve as a summary of this project. I will write a few posts (potentially on my personal website) about more details of this project and some technical backgrounds behind the selections that we made.

In this project, we tried to apply network analysis and community detection methods to identify meaningful publisher clusters based on the ISBN publisher code they use. From my perspective, this unsupervised learning approach was selected because of a lack of baseline test conducted from a large-scale perspective, so that supervised approach using any real-world data is not possible.

In the end, we get yearly publisher clusters that hopefully reflects the relationship between publishers in a given year. That is being said, community detection methods is difficult to be combined with temporal considerations. The year may not be a fully meaningful unit to analyze how publishers are connected to each other (the relationship between any two publishers may well change in the middle of a given year), but we still hope this approach to publisher clusters could generate more granular results than using data in all years. The next step, though turned out to be much more substantial that what was expected, is to use manual approach to evaluate the results. And hopefully this project will be published in a near future.

Despite its limitations, I really learnt a lot from this project. This is the first time I have to play with library metadata in a really large scale. As almost my first project too large to be dealt with by R, I gained extensive experiences using Python to deal with XML data. And during the process, I also read a lot about the publishing industry, whose relationship with our project was proven to be more than significant.

The last point above is also one that I wish I better realized in the beginning of this project. The most challenging part of this project is not any technical issue, but the complexity of the reality that we aim to understand through data analysis. Publishers and imprints could mean very different things in different social and data contexts. And there are different approaches to clustering them with their own meanings underlying the clusters. My lack of appreciation of the importance of the real publishing industry prevented me from foreseeing the difficulties of evaluating the results. I think in a way, this could mean that field knowledge is fundamental to any algorithmic understanding of this topic (or other topics data scientists have to work on), and to a lesser extent, any automatic method is only secondary to the final solution to this question.

LEADS Blog

Week – 9 Sonia Pascua – 1910 LCSH Database Schema

August 30, 2019September 6, 2019 mrc_team

LEADS site: Digital Scholarship Center

Project title: SKOS of the 1910 Library of Congress Subject Heading

The next traction achieved in this project was when the 1910 LCSH concepts were loaded to a database. Below are the screenshots of the CONCEPT table with the records which are the concepts of 1910 LCSH. This created database named “lchs1910.db”, is added into the list of vocabulary databases in HIVE. Next steps are to formulate a test case which will be provided by Peter and execute a query to check the results. It is also considered the loading of the created RDF or db to the live HIVE and Joan Boone, the developer of HIVE is on the assist. Couldn’t wait the end output of the testing and the live 1910 LCSH.

Volume 1 – Database Schema Letters A-F

Volume 2 – Database Schema Letters G-P

Volume 3 – Database Schema Letters S-Z

LEADS Blog

Rongqian Ma; Week 8-10

August 20, 2019August 30, 2019 Sam Grabus

Week 8-10: Re-organizing place and date information. Based on the problems that have appeared in the current version of visualizations, I performed another round of data cleaning and modification, especially for the date and geography information. With the goal of reducing the categories for each visualization, I merged some more data into others. For example, all the city information was merged into countries, single date information (e.g., 1470) was merged into the corresponding time period (e.g., in the case of the year 1470, it was merged into the 1450-1475 time period), and inconsistency of data across the time and geography categories was further manipulated. As demonstrated in the following example, the new version of visualizations gets more “clean” in terms of the number of categories and becomes more readable. For the last couple of weeks, I have also had discussions with my mentor about the visualizations, the problems I had, and have worked with my mentor for the data merge. I’m also working on a potential poster submission to iConference 2020.

Example:

LEADS Blog

Rongqian Ma; Week 6-7: Exploring Timeline JS for the Stories of Book of Hours

August 20, 2019January 21, 2020 Sam Grabus

Week 6-7: Exploring Timeline JS for the story of Book of Hours. I spent the past two weeks designing and creating a timeline of book of hours evolution using Timeline JS visualization site, which incorporates as much of the available information of the dataset as possible, including the date information, locations, digital images, and some textual descriptions. Timeline JS tool is an effective storytelling platform that combines multimedia resources and information. I initially started exploring this tool during the process of visualizing the date information of the dataset; I wanted to find a form of visualization that can examine the relationships between different categories of the dataset, especially those among the temporal, geographical, and content information of the manuscripts. I was able to create the timeline that demonstrates the evolution of book of hours manuscripts from the 14^th to the 16^th centuries, and develop a multimedia narrative of the book of hours. The biggest challenge of creating the timeline is to generate reasonable and meaningful period intervals. Because all the date information in the original dataset is presented heavily in texts and descriptions (e.g., circa 1460, mid-15^th century, 1450-1460), manipulating and reformatting the date information and changing it to an easily computed form is important. Following this task, the major work to do so as to decide on the intervals is to summarize the characteristics of each representative period and present them in the timeline. Creation of the timeline also entails reviewing relevant literature of book of hours, choosing the pictorial representations, and illustrating the characteristics of book of hours for each time period based on other categories of information (e.g., geolocations, decorations). Despite the advantages of Timeline JS as an effective tool, it appears more like a platform for “display of findings and results,” not an approach for “visual analysis.” [Based on discussions with my mentor, she is going to help with the textual descriptions of the book of hours in general and each section, which I really appreciate!]

LEADS Blog

Week 7-8 – Sonia Pascua, The SKOS of 1910 LCSH in RDF/XML format

August 20, 2019January 21, 2020 mrc_team

LEADS site: Digital Scholarship Center

Project title: SKOS of the 1910 Library of Congress Subject Heading

Technically the project output is accomplished this week, the SKOS of the 1910 LCHS in machine readable format, RDF/XML. However, to integrate the 1910 LCSH vocabulary which is now in RDF/XML to HIVE for the use of automatic indexing, is also one of the goals of this project.

The last two weeks of the project will be on the parsing of the SKOS elements to map to the database fields of HIVE. Moreover, vocabularies are added to the database to build the LCSH db. Once LCSH db is available, SQL scripts and queries of HIVE should be able to retrieve the data and use the indexing capabilities of HIVE.

See screenshot below of the 1910 LCSH SKOS.

Furthermore, below are the challenges that this project encountered:

Digitization – The TEI version of the 1910 LCSH encountered incompleteness therefore we need to go back to the digitization of the print copies and re-do the OCR process.
Encoding – Parsing, which is one of the activities done in this project encountered not only syntactic and basic semantic structure error but also logic and syntax/semantics interaction.
Programming
- Characterizing the states if possible and be able to enumerate all of them so that a conditional statement can be composed.
- Data is unclean that pattern is hardly identified for logic formulation.
Digitalization – MultiTes or Python Program
- MultiTes usage which is manual process but yields 98% accuracy in terms of reppresentation
- Building of a program (Python) to automate the SKOS creation from TEI format to RDF/XML format encountered pattern recognition challenges due to regular expression brought by the OCR process. This yielded higher percentage of error which were identified from the 47 inconsistencies found in the evaluation conducted when the control structures of the program was constructed. Further investigation could verify the percent error yield once compared to MultiTes version of SKOS RDF/XML.
Metadata – SKOS elements are limited to Concept, PrefLabel, Related and Notes. AltLabel, USE, USE FOR, BT and NT are not represented because HIVE database has no provision for them.

The SKOS-ification of the 1910 LCHS brought a lot of challenges that we documented to contribute to the case studies in digitization, encoding, programming, digitalization and metadata practices.

LEADS Blog

Alyson Gamble, Week 5: Historical Society of Pennsylvania

August 16, 2019August 19, 2019 Sam Grabus

We had a one-minute presentation with the Advisory Board. I wanted to share my slide, which covers some of the highlights of the project thus far.

There was audio to accompany the slide. If you’d like to listen to it, please let me know.

—

Alyson Gamble

Doctoral Student, Simmons University

LEADS Blog

Bridget Disney, California Digital Library – YAMZ

August 16, 2019August 19, 2019 Sam Grabus

California Digital Library – YAMZ

Bridget Disney

We have been duplicating our setup for the the local instance of YAMZ on the Amazon AWS server. The process is similar – kind of – and we’ve come across and worked through some major glitches in its setup.

One challenge that we have experienced is setting up the database. First we had to figure out where PostGreSQL was installed. The address is specified in the code but it had moved to a different location on the new server. There are different steps that the code goes through to determine which database to use (local or remote) and the rules have changed on the new system. Because of that, we have had to figure out our new environments and our permissions, documenting the process as we go along. We’ve set up a markdown file in GitHub which will be the final destination for our process documentation, but in the meantime, we made entries to a file in Google Docs as we worked through the process of the AWS installation.

Finally, we used pg_dump/pg_restore to move the data from the old to the new PostGreSQL database, so now we have over 2500 records and a functioning website on Amazon AWS! This has been a long time coming but it has helped me see the purpose of the whole project, which is to allow people to enter terms and then collaborate to determine which of those terms will become standard in different environments. In order for this to happen, this system will have to be used frequently and consistently over time.

I still have some concerns. Did we document the process correctly? It does not seem feasible to wipe everything out and reinstall it to make sure. Also, we still haven’t worked out the process that should be used for checking out code to make changes.

It’s been a productive summer and we’ve learned a lot, but I feel we are running out of time before completing our mission. Starting and stopping, summer to summer, without continuous focus can be detrimental to projects. This is not the first time I’ve encountered this as it seems to be prevalent in academic life.

So, in summary, I see two challenges to library/data science projects:

Bridging the gap between librarians and computer science knowledge
Maintaining the continuity of on going projects

LEADS Blog

Jamillah Gabriel: Python Functions for Merging and Visualizing

August 11, 2019August 11, 2019 Sam Grabus

This past week, I’ve been working on a function in Python that merges the two different datasets (WRA and FAR) so as to simplify the process of querying the data.

The reason for merging the data was to find a simpler alternative to the previous function for searching developed by Densho which involved if/else for loops to pull data from each dataset.

Now, one can search the data for a particular person and recover all of the available information about that person in a simple query. After the merge, the data output looks something like this when formulated as a list:

In addition to this, I’ve also played with some basic visualizations using Python to display some of the data in pie charts. I’m hoping to wrap up the last week working on more visualizations and functions for querying data.

Category: LEADS Blog

Alyson Gamble: Two years of Work!

Final Post: Julaine Clunis Wrap Up

Final post: Kai Li: Wrapping up of OCLC project

Week – 9 Sonia Pascua – 1910 LCSH Database Schema

Rongqian Ma; Week 8-10

Rongqian Ma; Week 6-7: Exploring Timeline JS for the Stories of Book of Hours

Week 7-8 – Sonia Pascua, The SKOS of 1910 LCSH in RDF/XML format

Alyson Gamble, Week 5: Historical Society of Pennsylvania

Bridget Disney, California Digital Library – YAMZ

Jamillah Gabriel: Python Functions for Merging and Visualizing