LEADS Blog

California Digital Library

California Digital Library – YAMZ
Bridget Disney
We are making slow and steady progress on YAMZ (pronounced yams). My task this week has been to import data into my local instance. I began by trying to import the data manually into PostgreSQL but got stuck even though I tried a few different methods I had found using Google.
This is where the advice of someone experienced come in helpful. In our Zoom meeting last week with John (mentor), Dillon (previous intern), and Hanlin, it became evident that I should have been using the import function that was available in YAMZ. Finally, progress could be made. I hammered out some fixes that allowed the data to put imported, but it wasn’t eloquent. Another meeting with John shed light on the correct way to do it.
YAMS uses four PostgreSQL tables: users, terms, comments, tracking. We had errors during the import because of the ‘terms’ data referencing a foreign key from the ‘users’ table. Because of that the ‘users’ table must be imported first. There were still other errors and we only ended up importing 43 records into the ‘terms’ table. There should have been about 2700! John will be providing us with another set of exported JSON files. The first one only had 252 records. He also provided us with some nifty Unix tricks for finding and replacing data.
 

 

On the server side, both Hanlin and I have been able to access the production site on AWS. We going to try to figure out how to get that running this week.
 
LEADS Blog

Week 5: Sonia Pascua – Project progress report

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading

 

I. Project update
  • Digitized 1910 LCSH was converted in Docx format by Peter
  • I was able to run the HIVE code in the local computer for code exploration
  • A sample db in HIVE is composed of 3 tables. Below is the LCHS db in HIVE
  • I was able to create the 1910 LCHS thesaurus for letter A in page 1 using MultiTes
  • I generated the html of the 1910 LCSH Multites Thesaurus

  • I also generated the RDF/XML format of the thesaurus
  • I am looking at the solution for the project. 
    • How will the Docx format of 1910 LCHS be converted to RDF automatically?
    • How will the Docx format of 1910 LCHS be loaded to HIVE DB automatically?
II. Concerns / Issues / Risks
  • Which solution to take given the limited time
  • SKOS in HIVE have limited elements of the standard SKOS
III. Pending action item
  • To explore MultiTes in the automation of converting 1910 LCSH Doc to RDF
  • To explore other tools in the automation of converting 1910 LCHS Doc to RDF
  • To explore the HIVE code in the automation of loading 1910 LCSH DOC to HIVE db
LEADS Blog

Alyson Gamble: Week 03

This week on my project at the Historical Society of Pennsylvania, I focused on exploring the data and planning for what I want to accomplish by the end of this fellowship. While exploring the data, as well as the project files from last year’s fellow, a few very important issues became apparent.
  1. Address data needs to be addressed using resources available to deal with old street names, as well as standardization of address formats
  2. School data needs to be adjusted for duplication and re-naming
  3. Occupation data needs to be considered. Can the non-standard occupations be mapped to controlled vocabularies? If not, how can this information be utilized.
These three main issues appear to be the best focus for my time during the next two weeks. To keep my mind active during this process, I’ll try to collect unusual examples, which I’ll share here.

Alyson Gamble
Doctoral Student, Simmons University
LEADS Blog

Jamillah Gabriel: Moments in Time

The Tule Lake exit phone book (FAR) data represents the majority of information available about the many Japanese American citizens who passed through the internment camp system. In most cases, this in conjunction with the limited data from the entry file, represents all of the information that is available. While there is not much here that allows us to paint a complete picture of their lives, we are at least able to conceive of some select moments in time, which I attempt to do in the case of Mrs. Kashi.

A screenshot of a computer Description automatically generated

Above: FAR exit file

 

A screenshot of a computer Description automatically generated

Above: Exit record for Mrs. Mitsuye Kashi

Mrs. Mitsuye Kashi was born on April 24, 1898 in the southern division of Honshu, Japan. Eighteen years later, she would arrive in the US, and later become an American citizen, marry Jutaro Kashi, and have a son, Tomio. Before internment, she and her family lived in Sacramento, California. But in 1942, the family was moved to a local assembly center located on Walerga Road, and soon after, assigned to the Sacramento internment camp. At some point, the family was transferred to the Tule Lake internment camp, where sadly, Mrs. Kashi would spend her last days. On June 4, 1943, less than a year after arriving at Tule Lake, Mrs. Kashi committed suicide. She was 45 years old. We have no record of why she committed suicide, but one can assume that life in the internment camp was unbearable for her.

In September, just three months later, Mrs. Kashi’s son was sent away to Central Utah Project, or the Topaz camp, which was a segregation center for dissidents. There are no records of what happened to Tomio afterwards. Her husband was released on June 28, 1944 and upon final departure from the camp, became a resident of Santa Fe, New Mexico.

LEADS Blog

Rongqian Ma; Week 3 – Visualizing the Date Information

During week 3 I focused on working with the date information of the manuscripts data. Similar to the geographical data, working with date information also means working with variants. The date information of the manuscript data is presented in descriptive texts (e.g., early 15th century); and the ways of the description vary across the collection. Most of the time data appear as ranges, e.g., 1425-1450), and there is a lot of overlap between the ranges. Ambiguity of the information exists across the dataset, mostly because the date information was collected and pieced together from texts of the manuscripts. Additionally, some manuscripts appear to be produced and refined during different time periods, with texts created earlier during the 14th and 15th centuries and illustrations/decorations added at a later time – for example. So the first task I did was to regroup the date information and make it more clear for visualization. This graph shows how I color-coded the dataset and grouped the data into five general categories – before the 15th century, 1400-1450 (first half of the 15th century), 1450-1500 (second half of the 15th century), 16th century, and cross-temporal/multiple periods.

 

       

 

Based on the groupings, I created multiple line graphs, histograms, and bar charts to visualize the temporal distribution of the book of hours productions from different aspects. The still visualizations assisted me in finding some interesting insights – for example, the production of the book of hours experienced an increase from the 1450s onward, which was relatively the same period of the inventing of the printing press.

 

But one problem of the still graphs is that they can’t effectively combine the date information with other information in the dataset, to explore the relationships between various aspects of the manuscript data and to display the “ecosystem” of the book of hours production and circulation in the middle ages Europe. Some questions that might be answered by interactive graphs include: If, during certain periods of time, was the book of hours production especially popular in certain countries or regions? And, did the decorations or stylistics of the genre change over time? To explore more interactive approaches, I am also exploring TimelineJS and creating a chronological gallery for the book of hours collection. TimelineJS is a storytelling tool that allows me to integrate time information, images of sample book of hours, and descriptive texts into the presentation. I am currently communicating with my mentor about this idea and I look forward to sharing more about it in the next few weeks’ blogs.

 

Best,

Rongqian

 

LEADS Blog

LEADS Blog #3 Setup a `virtualenv` for yamz!

 

Setup a `virtualenv` for yamz!

Hanlin Zhang

July 9th, 2019

 

This week I have solved a Google OAuth login problem caused by incompatible Python environments. Typically, there could be multiple versions of Python that are installed on the same machine, e.g. I have Python 2.7.10 (comes with my macOS), Python 2.7.16 (Anaconda), Python 3.7.1 (Anaconda) installed on my laptop, which may create some compatibility issue. In our case, we know yamz requires Python 2, but the real problem is that there are different versions of Python 2 and unexpected errors may occur if the program was installed on a “wrong” Python setup. The good news is, Bridget is able to run yamz successfully with the following configuration:

 

Python 2.7.10 on with Mac Mojave 10.14.5

 

However, I was unable to reproduce the same result in the first place since the program kept throwing me out an error message. I have done the initial debugging process with the help from Bridget, but I was still unable to solve the problem until John Kunze, our LEADS mentor, shed light on isolating the Python environment with `virtualenv`. John suspects the error was caused by running yamz on an Anaconda distribution of Python:

 

Python 2.7.16 (Anaconda) on macOS Mojave 10.14.5

 

which keeps fighting against the system’s default one. However, this can be solved by using a Python package called `virtualenv`. According to the documentation of `virtualenv` (see https://virtualenv.pypa.io/en/latest/), this Python package is able to “create isolated Python environments”, where it extracts a specified version of Python from my laptop and builds a virtual environment to run the program, which is very like running a virtual machine for Python on my laptop.

Luckily, `virtualenv` has solved the problem and now I’m able to login! Further, I’m also able to isolate the Python environment now, which allows me to do further investigations on the impact of Python versions on installing yamz. I’m going to explore install yamz on several different Python versions. Since Anaconda distributions are so common right now, I think it might worth it for me to test Anaconda Python and put the result in the new readme file. I’m curious about if the login problem was caused by Anaconda Python itself or the conflict between the default version of Python on my laptop and the Anaconda distribution I installed later. 

 

To learn more about `virtualenv`:

  • Virtualenv and why you should use virtual environments

https://www.youtube.com/watch?v=N5vscPTWKOk&t=139s

  • Working Effectively with Python Virtual Environments (Virtualenv)

https://www.youtube.com/watch?v=8KWVEc6vFgA&t=53s

 

LEADS Blog

Week 3: Metadata – data about data

 

LEADS site: Repository Analytics & Metrics Portal

 

On the 3rd week, I worked on downloading the metadata from the institutional repository. We already prepared a script to download the metadata based on the RAMP dataset we want to analyze. However, because every period there will always be a new request to the different documents, we must download the unique URL requested to have complete metadata.
 
Besides gathering the metadata, I also tried to do some analysis on the metadata and the Institutional Repositories. From my observation, I found that each Institutional Repository commonly uses some metadata terms, and there are also some unique terms that are used by only some IRs.
 
On the weekly meeting, we gathered some ideas that we want to focus working on, and we will work on to tackle these research questions with the supervision of Prof. Arlitsch and Jonathan for the upcoming weeks.
 
Nikolaus Parulian
LEADS Blog

Week 3-4: Sonia Pascua, The Paper and the proposal

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading

In the past weeks, I was able to progress by co-authoring a paper with Jane Greenberg, Peter Logan and Joan Boone. We’re able to submit the paper entitled “SKOS of the 1910 Library of Congress Subject Heading for the Transformation of the Keywords to Controlled Vocabulary of the Nineteenth-Century Encyclopedia Britannica” to NKOS 2019 which will be held in Dublin Core Conference 2019 in South Korea on Sept 23 -26, 2019. We couldn’t wait the acceptance of the paper hoping that this research has a novelty in the field of Simple Knowledge Organization Systems (SKOS).

This paper was also the starter in discussing what could really be the approaches in the SKOS-ination of 1910 LCHS.
This week I met with my mentor Peter for our weekly cadence. Scope was clarified and nailed down in this meeting.The project aims to transform the digitized 1910 LCHS to SKOS. Peter had shared the text file of the digitized 1910 LCHS and we’re able to discuss what could be the possible approaches for me to be able to execute my task. I appreciated the expertise of my mentor in handling the project and a mentee, like me. He made an effort to synchronize the concepts between us. We dwelled on the appropriate understanding between “keyword” and “index term” which I believe is very critical in building a thesaurus in SKOS. As I have presented to him my plan of execution, below are the paces we looked at to achieve the goal of the project:
  • Digitized format of 1910 LCHS is converted to text format to help in the manipulation of texts and words. This has been done already by Peter. The 1910 LCSH in digitized format which was made available by Google under the HathiTrust project is composed of 2 volumes. In the text format (.docx), volume 1 is composed of 363 pages and volume 2 has 379 pages.
  • Vocabularies are assessed to identify the structures and relationships of the vocabularies in 1910 LCHS and be able to be mapped to the elements and syntax of the SKOS vocabulary. These elements and syntax have integrity conditions that are used as a guideline for best practices in constructing SKOS vocabularies. 
  • Processes, methods and methodology are documented and tested for reproducibility and replication purposes. The project will run for 10 weeks and it’s challenging to be able to complete the SKOS-ination of the entire 2 volumes of the 1910 LCHS. However, if the processes, tools, techniques and guides are available, the project could be continued and knowledge could be transferred to completely finish the SKOS of the 1910 LCHS.
  • Tools to be used in building the SKOS of the 1910 LCHS and in automating its creation processes, are seen to be one of the vital output of this endeavor. 
For the moment, I have started reading the W3C Semantic Web and ALA guides to understand the methodologies and methods is constructing SKOS. In the search of the tools, MultiTes with which MRC has acquired license, will be started to explored.
My personal desire is not only to SKOSify 1910 LCHS but also to document the processes in finding the appropriate approach, techniques and tools that could be used by and shared not only to Digital Scholarship Center but also to other entities of the same project goal and objective. SKOS is a representation that is readily consumed on the web and allows vocabulary creators to publish born-digital vocabularies on the web. [Frazier, 2013].
References:
  1. Frazier, P. (2015, August 11). SKOS: A Guide for Information Professionals. Retrieved July 9, 2019, from http://www.ala.org/alcts/resources/z687/skos. Association for Library Collections and Technical Services, American Library Association
  2. HathiTrust: Home. (n.d.). Retrieved July 9, 2019, from www.hathitrust.org/. HathiTrust Digital Library
  3. Logan, P. (n.d.). Nineteenth-Century Knowledge Project. Retrieved July 9, 2019, from tu-plogan.github.io/. Digital Scholarship Center, Temple University
  4. SKOS Simple Knowledge Organization System – Home Page. (n.d.). Retrieved July 9, 2019, from https://www.w3.org/2004/02/skos/. Semantic Web Deployment Working Group, World Wide Web Consortium (W3C)
LEADS Blog

Jamillah Gabriel: Deep Diving into the Data

This past week has been spent delving into the datasets available to me in order to get a better sense of the lives of internees of the Japanese American interment camps, from entry to exit. What this means is that I’m looking at the entry data, exit data, and incident cards to glean a better understanding of life during this time. Some of the data that helps me in this endeavor are details about the first camp where a person entered the system, the assembly center they were taken to before getting to the camp, the date they first arrived at camp, other camps they may have been transferred to or from, the camp they last stayed at before exiting, their final departure date, the destination after their departure from the camp, birthdate, birthplace, and where they lived before internment (among many other details). The incident cards represent the recordkeeping system that includes details of various “offenses” that took place within the camps, and were typically only written up for people who violated rules in the camp, or in some cases, to keep records of deaths within the camp. Not every internee has incident cards, so there are silences and erasures within these archival records that might never be uncovered. But what one can do is gather up all of these details and possibly try to glean from them a narrative about the life of the internee imprisoned in these camps.

This is what I’m currently working on and I hope to share a little bit about select people in coming weeks. One of the most important things to consider is the sensitivity of these records as not all data can be publicly divulged at this point. NARA, the current steward of the records, has asked that we adhere to the restriction of 75 years when disclosing data. In other words, any records taking place after July 8, 1944 cannot be revealed. This is something I’ll have to keep in mind going forward in terms of how to best present the data in ways that both highlight and privilege the narratives and stories of the people unjustly imprisoned in these camps.

 

LEADS Blog

Week 3: Bridging NHM collection to Biodiversity Occurrence dataset – example of land snails

To recap what I was trying to do: I wanted to find a species in Taiwan that also happened to be mentioned in the Proceedings of Academy of Natural Sciences.
We chose Taiwan as our geographic point of interest because it has been historically complex in terms of sovereignty and will probably be interesting as an example to see shifting geopolitical realities.
This whole week I have been brushing up the use cases on the example we gathered from the Biodiversity Heritage Library — a land snail species “Pupinella swinhoei sec. H. Adams 1866″.
 
The idea is to bring the Natural History Museum literature (NHM) closer to real life biodiversity occurrence dataset. I then gathered the dataset from GBIF with search term on scientific name “Pupinella swinhoei”. The aggregated GBIF dataset contains 50 occurrence records across 18 institutions (18 datasets), ranging from year 1700 to now
 

(different colors indicate they are from different data source)
Though the ‘countryCode’ field mostly indicated that the records are from TW (Taiwan), it may not be the sovereignty at that time period. To merge these datasets with the sovereignty at the time, I examined two of the 18 data sources first: MCZ dataset versus NSSM dataset.  
The 1700 Taiwan is a county within the Qing Dynasty China.
And the 1930s Taiwan is a colonized region of Japan.
I had some preliminary results to merge these two dataset’s sovereignty field by using the logic-based taxonomy alignment approach. However, since I am preparing a submission for a conference based on this use case- I don’t want to jinx anything! (Fingers crossed).
If I am allowed to share more about the paper, I promise to discuss more in the next blog post!
Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica