July 2019 – Metadata Research Center

In recent weeks, my project has taken an unexpected turn from data storytelling and visualization towards one of data processing. As it turns out, our partner organization (Densho.org) has already done some data cleaning in Open Refine, created a database, and began preliminary data processing. I’ll be using Python and Jupyter Notebook to continue the work they’ve started, first by testing previous processes and then by creating new processes. I also found out that the data doesn’t have unique identifiers so I’ll be using the following workaround for attempting to isolate pockets of data.

In this partial example (there’s more to it than what’s seen in this screenshot), I’ll need to query the data using a for loop that searches for a combination of first name, last name, family number, and year of birth in order to precisely locate data in a way that potentially replicates the use of a unique identifier. I’m finding that not having a unique identifier makes it much more difficult to access data quickly and accurately, but hopefully this for loop will do the trick. I’m looking forward to playing with the code more and seeing what can be discovered.

Alyson Gamble, Week 4: Historical Society of Pennsylvania

July 26, 2019July 26, 2019 Sam Grabus

I just realized that the titles of my posts aren’t in a standardized format, so I won’t complain too much about nonstandard entries in data sets–at least not this week. 😉

In the last few days, I’ve made some progress on hand analyzing the schools (seems done) and the occupational data (not done). What is done is me using R for this project. I’m officially no longer trying to “fix” things using R. R has a lot of great capabilities, and I usually like working in it, but it wasn’t serving me for this particular project. Sometimes the best thing to do is to admit that something’s not working, and move on to a different system.

Thanks to my mentor, Caroline, I’ve collected more useful resources to help with the next steps of the project. I’m also following advice from Caroline and several LEADS people, including other 2019 fellows, and looking at OpenRefine as my antidote for the street data. As Bridget pointed out in a comment on the last post, the way we talk about addresses isn’t standard: six miles outside of town doesn’t exactly correspond to an exact set of coordinates.

My goals to accomplish by next Friday are: (1) create a slide and one-minute recording for the advisory board meeting on my work thus far + (2) find some genuinely fun things to add to that slide, such as interesting ways to refer to geographic locations + (3) be genuinely “done” with the school and occupational data. Then I can start working in OpenRefine.

As a final bit, Monica at the Historical Society of Pennsylvania wrote up a nice post about the work being done on this project, which you can read here: https://hsp.org/blogs/fondly-pennsylvania/enhancing-access-through-data-visualization

—

Alyson Gamble

Doctoral Student, Simmons University

LEADS Blog

Week 2: Kai Li: What I talk about when I talk about publishers

July 26, 2019July 26, 2019 Sam Grabus

As I mentioned in my previous posts, the entitization of publishers was only recently problematized when the new BibFrame model was proposed, which treats publishers as a separate entity in the overall bibliographic universe, rather than a text string in the MARC record. However, from the perspective of cataloging, we still do not seem to know too much about what a publisher is.

In the MARC record, two fields, 260 and 264, are used for describing information about the publication, printing, distribution, issue, release, or production of the resource. The use of these two fields are different in the two cataloging rules, AACR2 (The Anglo-American Cataloging Rules, 2nd edition) and RDA (Resource Description and Access) that replaces AACR2. In the period of AACR2, all publisher and distributor information should be described in 260 subfield b, where multiple subfields can be used when there are more than one publishers/distributors. In the RDA rules, however, the 264 field should be used and different functions (primarily publication and distribution in the previous context) are distinguished by the second indicator of this field. One of the issues with the AACR2 rules is that it does not require publisher names to be transcribed just as what is displayed in the resource: catalogers have the freedom to omit or abbreviate some name components, such as “publishers” and “limited.” In certain ways, the RDA rules is more consistent with how publishers are supposed to be dealt with in a more modern information infrastructure: that publishers should be recorded in more consistent and structure manners and not mixing with other types of entities (especially distributors but also printers and issuers). But in practice, the majority of library bibliographic records were produced under AACRS rules, which are almost impossible to be transformed into RDA rules because we do not know what name components were omitted or abbreviated.

While how publisher names are described (inconsistently) in the MARC format is just one barrier to the identification of publishers that is relatively easy to solve, a real challenge in the present project is the history of the publishing industry. In the real-world context, what is described in 260/264 subfield b is just an imprint, which, by definition, is the unit that publishes, no matter what the unit is (it could be a publisher, or a brand or branch that is owned by the publisher, or an individual person that publishes the resource). For example, in this link, you can see all imprints that are owned by Penguin Random House, which BTW, was merged from Penguin Group and Random House in 2013, two of the largest publishers in the American publishing market.

Throughout the history of the publishing industry, publishers have been merging and splitting, just like the example of Penguin Random House. They might acquire a different publisher in total, or just some brands (imprints) owned by another publisher. And in some rare cases, an imprint was sold to a different publisher but was sold back to its original owner later. Shown below is a slice of data manually collected by Cecilia about the history of Wiley, another major publisher in America.

Screen Shot 2019-07-26 at 3.17.03 PM copy.jpg

[A slice of the history of Wiley]

From this imprint-centered view, a publisher is a higher level entity than imprints that includes all its child entities in a given time. In other words, quite unlike other bibliographic concepts, such as works (“great works are timeless”), publishers or imprints exist in a temporal framework. But this is a huge challenge for this project, partly because the idea of temporality is extremely difficult to be combined with network analysis methods. While I cannot give any solution at this time for this difficulty, this will be an interesting topic to be further addressed in my works.

LEADS Blog

Week 4-5: Implementing the experiments & blockers

July 26, 2019July 26, 2019 Sam Grabus

These two weeks I have been implementing the experiments we proposed to do: pairwise alignments of the ‘historical sovereignty’ of Taiwan.

Apart from the Darwin-core based occurrence dataset, we believe that adding an extra field called ‘historical sovereignty’ will be very beneficial for scientists to study the historical distribution of certain species. For the case of Pupinella swinhoei , land snail, we found most of the occurrence to be in the location of Taiwan. As the last blog post said, the years that this species occur are across a broad range: from 1700 to now.

However, some blockers I had when I was looking through the actual dataset are as the following:

1. Country Code: If the dataset indicated that the country code is TW (Taiwan), sometimes it is JP (Japan), did they really meant that these species occur in such location? When we cross-referenced the ‘country code’ field with the ‘locality’ field, there’s also some discrepancies such as ‘country code’ being Japan, but the locality is Formosa (Taiwan’s alias). What gets weirder is that the year indicated these records are 1700 — and at that time Taiwan was not part of Japan. The country code, locality, and year fields are problematic in this sense.

2. Year: We have 50 records in total on Pupinella swinhoei. Almost all the records have country codes, but more than two thirds of the records are missing the year information. Knowing the year that the species appeared or was collected is crucial, given this is one factor on how we determine the historical sovereignty of Taiwan.

I suppose we could go from another direction and look at Taiwan’s historical sovereignty based on Taiwan’s timeline – but if we disregard the occurrence data’s ‘years’ and operate solely with other outside information, our original goal of proposing a ‘more precise’ way for merging taxonomically organized dataset would be lost. And also, we probably cannot view this as constructing a data-driven knowledge graph (our endgame).

Another workaround is to have dummy records in addition to the real records, and fill in the years that we wanted to examine.

More to be discussed. Until next week!

Yi-Yun Cheng

PhD student, Research Assistant

School of Information Sciences, University of Illinois at Urbana-Champaign

Email: yiyunyc2@illinois.edu

Twitter: @yiyunjessica

Website: https://publish.illinois.edu/yiyuncheng/

LEADS Blog

California Digital Library

July 18, 2019July 22, 2019 Sam Grabus

California Digital Library – YAMZ

Bridget Disney

We are making slow and steady progress on YAMZ (pronounced yams). My task this week has been to import data into my local instance. I began by trying to import the data manually into PostgreSQL but got stuck even though I tried a few different methods I had found using Google.

This is where the advice of someone experienced come in helpful. In our Zoom meeting last week with John (mentor), Dillon (previous intern), and Hanlin, it became evident that I should have been using the import function that was available in YAMZ. Finally, progress could be made. I hammered out some fixes that allowed the data to put imported, but it wasn’t eloquent. Another meeting with John shed light on the correct way to do it.

YAMS uses four PostgreSQL tables: users, terms, comments, tracking. We had errors during the import because of the ‘terms’ data referencing a foreign key from the ‘users’ table. Because of that the ‘users’ table must be imported first. There were still other errors and we only ended up importing 43 records into the ‘terms’ table. There should have been about 2700! John will be providing us with another set of exported JSON files. The first one only had 252 records. He also provided us with some nifty Unix tricks for finding and replacing data.

On the server side, both Hanlin and I have been able to access the production site on AWS. We going to try to figure out how to get that running this week.

LEADS Blog

Week 5: Sonia Pascua – Project progress report

July 17, 2019July 22, 2019 mrc_team

LEADS site: Digital Scholarship Center

Project title: SKOS of the 1910 Library of Congress Subject Heading

I. Project update

Digitized 1910 LCSH was converted in Docx format by Peter

I was able to run the HIVE code in the local computer for code exploration

A sample db in HIVE is composed of 3 tables. Below is the LCHS db in HIVE

I was able to create the 1910 LCHS thesaurus for letter A in page 1 using MultiTes

I generated the html of the 1910 LCSH Multites Thesaurus

I also generated the RDF/XML format of the thesaurus

I am looking at the solution for the project.
- How will the Docx format of 1910 LCHS be converted to RDF automatically?
- How will the Docx format of 1910 LCHS be loaded to HIVE DB automatically?

II. Concerns / Issues / Risks

Which solution to take given the limited time
SKOS in HIVE have limited elements of the standard SKOS

III. Pending action item

To explore MultiTes in the automation of converting 1910 LCSH Doc to RDF
To explore other tools in the automation of converting 1910 LCHS Doc to RDF
To explore the HIVE code in the automation of loading 1910 LCSH DOC to HIVE db

LEADS Blog

Alyson Gamble: Week 03

July 16, 2019July 16, 2019 Sam Grabus

This week on my project at the Historical Society of Pennsylvania, I focused on exploring the data and planning for what I want to accomplish by the end of this fellowship. While exploring the data, as well as the project files from last year’s fellow, a few very important issues became apparent.

Address data needs to be addressed using resources available to deal with old street names, as well as standardization of address formats
School data needs to be adjusted for duplication and re-naming
Occupation data needs to be considered. Can the non-standard occupations be mapped to controlled vocabularies? If not, how can this information be utilized.

These three main issues appear to be the best focus for my time during the next two weeks. To keep my mind active during this process, I’ll try to collect unusual examples, which I’ll share here.

—

Alyson Gamble

Doctoral Student, Simmons University

Sam Grabus exploring the canals in Utrecht

News & Events

MRC’s Sam Grabus presents at Digital Humanities 2019, in Utrecht

July 15, 2019July 15, 2019 Sam Grabus

MRC PhD student Sam Grabus and Temple University’s Peter Logan presented their paper at Digital Humanities 2019 in Utrecht, Netherlands, on Thursday, July 11th.

Sam Grabus presenting at DH 2019, in Utrecht, demonstrating how the HIVE tool maps naturally-extracted keywords to controlled vocabulary terms.

The presentation, entitled, “Knowledge Representation: Old, New, and Automated Indexing,” shared comparative topic relevance results from automatically indexing 19th century Encyclopedia Britannica entries with two controlled vocabularies: an historical knowledge organization system developed by Ephraim Chambers, as well as the contemporary Library of Congress Subject Headings.

LEADS Blog

Jamillah Gabriel: Moments in Time

July 14, 2019July 16, 2019 Sam Grabus

The Tule Lake exit phone book (FAR) data represents the majority of information available about the many Japanese American citizens who passed through the internment camp system. In most cases, this in conjunction with the limited data from the entry file, represents all of the information that is available. While there is not much here that allows us to paint a complete picture of their lives, we are at least able to conceive of some select moments in time, which I attempt to do in the case of Mrs. Kashi.

Above: FAR exit file

Above: Exit record for Mrs. Mitsuye Kashi

Mrs. Mitsuye Kashi was born on April 24, 1898 in the southern division of Honshu, Japan. Eighteen years later, she would arrive in the US, and later become an American citizen, marry Jutaro Kashi, and have a son, Tomio. Before internment, she and her family lived in Sacramento, California. But in 1942, the family was moved to a local assembly center located on Walerga Road, and soon after, assigned to the Sacramento internment camp. At some point, the family was transferred to the Tule Lake internment camp, where sadly, Mrs. Kashi would spend her last days. On June 4, 1943, less than a year after arriving at Tule Lake, Mrs. Kashi committed suicide. She was 45 years old. We have no record of why she committed suicide, but one can assume that life in the internment camp was unbearable for her.

In September, just three months later, Mrs. Kashi’s son was sent away to Central Utah Project, or the Topaz camp, which was a segregation center for dissidents. There are no records of what happened to Tomio afterwards. Her husband was released on June 28, 1944 and upon final departure from the camp, became a resident of Santa Fe, New Mexico.

LEADS Blog

Rongqian Ma; Week 3 – Visualizing the Date Information

July 14, 2019July 16, 2019 Sam Grabus

During week 3 I focused on working with the date information of the manuscripts data. Similar to the geographical data, working with date information also means working with variants. The date information of the manuscript data is presented in descriptive texts (e.g., early 15^th century); and the ways of the description vary across the collection. Most of the time data appear as ranges, e.g., 1425-1450), and there is a lot of overlap between the ranges. Ambiguity of the information exists across the dataset, mostly because the date information was collected and pieced together from texts of the manuscripts. Additionally, some manuscripts appear to be produced and refined during different time periods, with texts created earlier during the 14^th and 15^th centuries and illustrations/decorations added at a later time – for example. So the first task I did was to regroup the date information and make it more clear for visualization. This graph shows how I color-coded the dataset and grouped the data into five general categories – before the 15^th century, 1400-1450 (first half of the 15^th century), 1450-1500 (second half of the 15^th century), 16^th century, and cross-temporal/multiple periods.

Based on the groupings, I created multiple line graphs, histograms, and bar charts to visualize the temporal distribution of the book of hours productions from different aspects. The still visualizations assisted me in finding some interesting insights – for example, the production of the book of hours experienced an increase from the 1450s onward, which was relatively the same period of the inventing of the printing press.

But one problem of the still graphs is that they can’t effectively combine the date information with other information in the dataset, to explore the relationships between various aspects of the manuscript data and to display the “ecosystem” of the book of hours production and circulation in the middle ages Europe. Some questions that might be answered by interactive graphs include: If, during certain periods of time, was the book of hours production especially popular in certain countries or regions? And, did the decorations or stylistics of the genre change over time? To explore more interactive approaches, I am also exploring TimelineJS and creating a chronological gallery for the book of hours collection. TimelineJS is a storytelling tool that allows me to integrate time information, images of sample book of hours, and descriptive texts into the presentation. I am currently communicating with my mentor about this idea and I look forward to sharing more about it in the next few weeks’ blogs.

Best,

Rongqian

Month: July 2019

Jamillah Gabriel: Working Around the Unique Identifier

Alyson Gamble, Week 4: Historical Society of Pennsylvania

Week 2: Kai Li: What I talk about when I talk about publishers

Week 4-5: Implementing the experiments & blockers

California Digital Library

Week 5: Sonia Pascua – Project progress report

Alyson Gamble: Week 03

MRC’s Sam Grabus presents at Digital Humanities 2019, in Utrecht

Jamillah Gabriel: Moments in Time

Rongqian Ma; Week 3 – Visualizing the Date Information