Sam Grabus

New Data Science Tool

August 8, 2019August 8, 2019 Sam Grabus

For the project, we are also interested in matching against AAT. We have written a SPARQL query to get subject terms from the Getty AAT which was downloaded in json format.

Having data in these various formats I needed to find a way to work with both and evaluate data in one type of file against the other. In the search for answers I came across a tool for data analytics. It can be used for data preparation, blending, advanced and predictive analytics and other data science tasks and so is able to take inputs from various file formats and work with them outright.

A unique feature of the tool is the ability to build a workflow which can be exported and shared and which other members of a team can use as is or can probably easily turn into code if need be.

I’ve managed to join the json and csv file and check for matches and was able to identify exact matches after performing some transformations. This tool has a fuzzy match function which I am still trying to figure out and get working in an effective workflow that can be reproduced. I suspect that will be taking up quite a bit of my time.

—

Julaine Clunis

Clustering

August 8, 2019August 8, 2019 Sam Grabus

One of the things that we’ve noticed about the dataset is that beyond duplicate terms there are subject terms that refer to the same thing that are spelled or entered differently by the contributing institution but which refer to the same thing. We’ve been thinking about using clustering applications to look at the data to see what kinds of matches are returned.

It was necessary to first do some reading on what the different clustering methods would do and how that might work for the data we have. We did end up trying some clustering using various key collision methods (Fingerprint, n-gram fingerprint) and KNN and Levenshtein distance methods. They return different results and we are still looking at the results returned from this before performing any merge functions. It is possible that terms look the same or seem similar but are in fact different so it is not as simple as just merging everything that matches.

One important question to answer is how accurate are the clusters and whether we can trust the results enough to go ahead and automatically merge. My feeling is that a lot of human oversight is needed to evaluate the clusters.

Another thing we want to test is how much faster the reconciliation process would be if we accepted and merged the results from the cluster and whether it was worth the time to do it, i.e. if we cluster and then do string matching, is there an improvement in the results or are they basically the same.

—

Julaine Clunis

LEADS Blog

Extracting Subjects

August 8, 2019August 8, 2019 Sam Grabus

After my last post I spent some time, along with my mentors figuring out how to isolate the subject headings and ids from the dataset. We decided that since the dataset was so large and my machine did not have the power to handle it all. We would do all our test with a sample subset. Using some python code with Apache Spark we managed to isolate the subject terms from these records terms and output them as a csv file. The sample we yielded over 700,000 subject terms.

One of the goals of this project was to map these term against LCSH. At first my idea was to download the LCSH dataset in xml and see what kind of scripting I could do with it. However, I discovered that there is a Python script which extends OpenRefine and which will perform reconciliation against the LOC API which we decided to test. This allows you to load a csv file and run the reconciliation script against it. We found that this is an effective method to find close matches where the confidence level is over 85% for a match. The reconciliation process returns the term as listed in LCSH along with a URI which can be saved with the original data. The biggest concern with this method is the time that it takes to run within OpenRefine, however my mentors feel that this process can be captured and run in a similar way outside the tool using other programming methods.

Later we manually checked the items that were returned to see if they in fact were matching and happily everything has checked out. There still remains a question as to whether or not there are subjects that are not close/exact matches but rather fuzzy matches and how to identify and get uri results for those. Also, the dataset seemed to have a number of duplicates and data that may need some kind of cleaning preparation, so that is another thing that may need to be examined.

—

Julaine Clunis

LEADS Blog

Rongqian Ma; Week 4-5: Visualizing Decorations Information

August 8, 2019August 8, 2019 Sam Grabus

Decoration information of the manuscripts is one of the most complex categories of information in the dataset, and to visualize it needs much work of data pre-processing. There are two layers of information that is contained in the dataset: a) one is what decorations the manuscripts include; and b) the other is how those decorations are arranged across the manuscripts. Delivering such information in the dataset may potentially communicate the decorative characteristics of the book of hours. For the what part, I identified several major decorative elements of the manuscripts from the dataset and color-coded each element in the Excel sheet, such as the illuminated initial, miniature (large and small), foliate, border (border decorations), bookplate (usually indicating the ownership of the book), catalog, notation, and multiple pictorial themes and imageries (e.g., annunciation, crucifixion, Pentecost, betrayal, and lamentation, Mary, Christ). Figure 1 demonstrates my preliminary attempt to visualize the decorative information of the manuscripts. I coded the major decorative patterns of the visualizations for the left half of the coding graph and the major pictorial themes (e.g., Virgin, Christ, Annunciation) for the right half of the graph. From this preliminary coding graph, we could see that there appears two general decorative styles for the book of hours. One type of decoration focuses on making the manuscripts beautiful and the other type focuses on displaying stories and the meaning behind them using pictorial representations of the texts. I then went back to check the original digitized images of the manuscript collection and found that the patterns were mostly utilized to decorate texts (appear surrounding the texts) while the other style appears mostly as full-leaf miniatures supplementing the texts. A preliminary analysis of the two styles’ relationship with the geographic information also suggests that the majority of the first decoration style is associated with France while the other that’s more emphasized on the miniature storytelling is more associated with the production locations such as Bruges.

For the second step, I explored the transitions as well as relationships among different decorative elements using Tableau, Voyant, and Wordle. Figure 2 is a word cloud that demonstrates the frequency of the major decoration elements across the whole manuscript collection. The Voyant Tools, in comparison, provides a way to further demonstrate the strengths of relationships among different decorative elements across the dataset. Here is an example. Treating all the decoration information as texts, the “links” feature in Voyant demonstrates the relationships among different elements. For instance, we could see that the strength of the link between the “illuminated” and “initial” is the strongest and there are also associations between different elements of decoration, such as “decorated,” “line,” “miniature,” “border,” “bookplate,” and “vignette.” The dataset has also attested that patterns such as illuminated initials, miniature, and bookplates demonstrating the ownership of the book, are the most common elements. The links, however, do not present any of the relationships among different themes.

Figure 1.

Figure 2.

Figure 3. Voyant analysis of the decorating information.

LEADS Blog

Jamillah Gabriel: Working Around the Unique Identifier

July 30, 2019August 2, 2019 Sam Grabus

In recent weeks, my project has taken an unexpected turn from data storytelling and visualization towards one of data processing. As it turns out, our partner organization (Densho.org) has already done some data cleaning in Open Refine, created a database, and began preliminary data processing. I’ll be using Python and Jupyter Notebook to continue the work they’ve started, first by testing previous processes and then by creating new processes. I also found out that the data doesn’t have unique identifiers so I’ll be using the following workaround for attempting to isolate pockets of data.

In this partial example (there’s more to it than what’s seen in this screenshot), I’ll need to query the data using a for loop that searches for a combination of first name, last name, family number, and year of birth in order to precisely locate data in a way that potentially replicates the use of a unique identifier. I’m finding that not having a unique identifier makes it much more difficult to access data quickly and accurately, but hopefully this for loop will do the trick. I’m looking forward to playing with the code more and seeing what can be discovered.

LEADS Blog

Alyson Gamble, Week 4: Historical Society of Pennsylvania

July 26, 2019July 26, 2019 Sam Grabus

I just realized that the titles of my posts aren’t in a standardized format, so I won’t complain too much about nonstandard entries in data sets–at least not this week. 😉

In the last few days, I’ve made some progress on hand analyzing the schools (seems done) and the occupational data (not done). What is done is me using R for this project. I’m officially no longer trying to “fix” things using R. R has a lot of great capabilities, and I usually like working in it, but it wasn’t serving me for this particular project. Sometimes the best thing to do is to admit that something’s not working, and move on to a different system.

Thanks to my mentor, Caroline, I’ve collected more useful resources to help with the next steps of the project. I’m also following advice from Caroline and several LEADS people, including other 2019 fellows, and looking at OpenRefine as my antidote for the street data. As Bridget pointed out in a comment on the last post, the way we talk about addresses isn’t standard: six miles outside of town doesn’t exactly correspond to an exact set of coordinates.

My goals to accomplish by next Friday are: (1) create a slide and one-minute recording for the advisory board meeting on my work thus far + (2) find some genuinely fun things to add to that slide, such as interesting ways to refer to geographic locations + (3) be genuinely “done” with the school and occupational data. Then I can start working in OpenRefine.

As a final bit, Monica at the Historical Society of Pennsylvania wrote up a nice post about the work being done on this project, which you can read here: https://hsp.org/blogs/fondly-pennsylvania/enhancing-access-through-data-visualization

—

Alyson Gamble

Doctoral Student, Simmons University

LEADS Blog

Week 2: Kai Li: What I talk about when I talk about publishers

July 26, 2019July 26, 2019 Sam Grabus

As I mentioned in my previous posts, the entitization of publishers was only recently problematized when the new BibFrame model was proposed, which treats publishers as a separate entity in the overall bibliographic universe, rather than a text string in the MARC record. However, from the perspective of cataloging, we still do not seem to know too much about what a publisher is.

In the MARC record, two fields, 260 and 264, are used for describing information about the publication, printing, distribution, issue, release, or production of the resource. The use of these two fields are different in the two cataloging rules, AACR2 (The Anglo-American Cataloging Rules, 2nd edition) and RDA (Resource Description and Access) that replaces AACR2. In the period of AACR2, all publisher and distributor information should be described in 260 subfield b, where multiple subfields can be used when there are more than one publishers/distributors. In the RDA rules, however, the 264 field should be used and different functions (primarily publication and distribution in the previous context) are distinguished by the second indicator of this field. One of the issues with the AACR2 rules is that it does not require publisher names to be transcribed just as what is displayed in the resource: catalogers have the freedom to omit or abbreviate some name components, such as “publishers” and “limited.” In certain ways, the RDA rules is more consistent with how publishers are supposed to be dealt with in a more modern information infrastructure: that publishers should be recorded in more consistent and structure manners and not mixing with other types of entities (especially distributors but also printers and issuers). But in practice, the majority of library bibliographic records were produced under AACRS rules, which are almost impossible to be transformed into RDA rules because we do not know what name components were omitted or abbreviated.

While how publisher names are described (inconsistently) in the MARC format is just one barrier to the identification of publishers that is relatively easy to solve, a real challenge in the present project is the history of the publishing industry. In the real-world context, what is described in 260/264 subfield b is just an imprint, which, by definition, is the unit that publishes, no matter what the unit is (it could be a publisher, or a brand or branch that is owned by the publisher, or an individual person that publishes the resource). For example, in this link, you can see all imprints that are owned by Penguin Random House, which BTW, was merged from Penguin Group and Random House in 2013, two of the largest publishers in the American publishing market.

Throughout the history of the publishing industry, publishers have been merging and splitting, just like the example of Penguin Random House. They might acquire a different publisher in total, or just some brands (imprints) owned by another publisher. And in some rare cases, an imprint was sold to a different publisher but was sold back to its original owner later. Shown below is a slice of data manually collected by Cecilia about the history of Wiley, another major publisher in America.

Screen Shot 2019-07-26 at 3.17.03 PM copy.jpg

[A slice of the history of Wiley]

From this imprint-centered view, a publisher is a higher level entity than imprints that includes all its child entities in a given time. In other words, quite unlike other bibliographic concepts, such as works (“great works are timeless”), publishers or imprints exist in a temporal framework. But this is a huge challenge for this project, partly because the idea of temporality is extremely difficult to be combined with network analysis methods. While I cannot give any solution at this time for this difficulty, this will be an interesting topic to be further addressed in my works.

LEADS Blog

Week 4-5: Implementing the experiments & blockers

July 26, 2019July 26, 2019 Sam Grabus

These two weeks I have been implementing the experiments we proposed to do: pairwise alignments of the ‘historical sovereignty’ of Taiwan.

Apart from the Darwin-core based occurrence dataset, we believe that adding an extra field called ‘historical sovereignty’ will be very beneficial for scientists to study the historical distribution of certain species. For the case of Pupinella swinhoei , land snail, we found most of the occurrence to be in the location of Taiwan. As the last blog post said, the years that this species occur are across a broad range: from 1700 to now.

However, some blockers I had when I was looking through the actual dataset are as the following:

1. Country Code: If the dataset indicated that the country code is TW (Taiwan), sometimes it is JP (Japan), did they really meant that these species occur in such location? When we cross-referenced the ‘country code’ field with the ‘locality’ field, there’s also some discrepancies such as ‘country code’ being Japan, but the locality is Formosa (Taiwan’s alias). What gets weirder is that the year indicated these records are 1700 — and at that time Taiwan was not part of Japan. The country code, locality, and year fields are problematic in this sense.

2. Year: We have 50 records in total on Pupinella swinhoei. Almost all the records have country codes, but more than two thirds of the records are missing the year information. Knowing the year that the species appeared or was collected is crucial, given this is one factor on how we determine the historical sovereignty of Taiwan.

I suppose we could go from another direction and look at Taiwan’s historical sovereignty based on Taiwan’s timeline – but if we disregard the occurrence data’s ‘years’ and operate solely with other outside information, our original goal of proposing a ‘more precise’ way for merging taxonomically organized dataset would be lost. And also, we probably cannot view this as constructing a data-driven knowledge graph (our endgame).

Another workaround is to have dummy records in addition to the real records, and fill in the years that we wanted to examine.

More to be discussed. Until next week!

Yi-Yun Cheng

PhD student, Research Assistant

School of Information Sciences, University of Illinois at Urbana-Champaign

Email: yiyunyc2@illinois.edu

Twitter: @yiyunjessica

Website: https://publish.illinois.edu/yiyuncheng/

LEADS Blog

California Digital Library

July 18, 2019July 22, 2019 Sam Grabus

California Digital Library – YAMZ

Bridget Disney

We are making slow and steady progress on YAMZ (pronounced yams). My task this week has been to import data into my local instance. I began by trying to import the data manually into PostgreSQL but got stuck even though I tried a few different methods I had found using Google.

This is where the advice of someone experienced come in helpful. In our Zoom meeting last week with John (mentor), Dillon (previous intern), and Hanlin, it became evident that I should have been using the import function that was available in YAMZ. Finally, progress could be made. I hammered out some fixes that allowed the data to put imported, but it wasn’t eloquent. Another meeting with John shed light on the correct way to do it.

YAMS uses four PostgreSQL tables: users, terms, comments, tracking. We had errors during the import because of the ‘terms’ data referencing a foreign key from the ‘users’ table. Because of that the ‘users’ table must be imported first. There were still other errors and we only ended up importing 43 records into the ‘terms’ table. There should have been about 2700! John will be providing us with another set of exported JSON files. The first one only had 252 records. He also provided us with some nifty Unix tricks for finding and replacing data.

On the server side, both Hanlin and I have been able to access the production site on AWS. We going to try to figure out how to get that running this week.

LEADS Blog

Alyson Gamble: Week 03

July 16, 2019July 16, 2019 Sam Grabus

This week on my project at the Historical Society of Pennsylvania, I focused on exploring the data and planning for what I want to accomplish by the end of this fellowship. While exploring the data, as well as the project files from last year’s fellow, a few very important issues became apparent.

Address data needs to be addressed using resources available to deal with old street names, as well as standardization of address formats
School data needs to be adjusted for duplication and re-naming
Occupation data needs to be considered. Can the non-standard occupations be mapped to controlled vocabularies? If not, how can this information be utilized.

These three main issues appear to be the best focus for my time during the next two weeks. To keep my mind active during this process, I’ll try to collect unusual examples, which I’ll share here.

—

Alyson Gamble

Doctoral Student, Simmons University