LEADS Blog #3 Setup a `virtualenv` for yamz!


Setup a `virtualenv` for yamz!

Hanlin Zhang

July 9th, 2019


This week I have solved a Google OAuth login problem caused by incompatible Python environments. Typically, there could be multiple versions of Python that are installed on the same machine, e.g. I have Python 2.7.10 (comes with my macOS), Python 2.7.16 (Anaconda), Python 3.7.1 (Anaconda) installed on my laptop, which may create some compatibility issue. In our case, we know yamz requires Python 2, but the real problem is that there are different versions of Python 2 and unexpected errors may occur if the program was installed on a “wrong” Python setup. The good news is, Bridget is able to run yamz successfully with the following configuration:


Python 2.7.10 on with Mac Mojave 10.14.5


However, I was unable to reproduce the same result in the first place since the program kept throwing me out an error message. I have done the initial debugging process with the help from Bridget, but I was still unable to solve the problem until John Kunze, our LEADS mentor, shed light on isolating the Python environment with `virtualenv`. John suspects the error was caused by running yamz on an Anaconda distribution of Python:


Python 2.7.16 (Anaconda) on macOS Mojave 10.14.5


which keeps fighting against the system’s default one. However, this can be solved by using a Python package called `virtualenv`. According to the documentation of `virtualenv` (see https://virtualenv.pypa.io/en/latest/), this Python package is able to “create isolated Python environments”, where it extracts a specified version of Python from my laptop and builds a virtual environment to run the program, which is very like running a virtual machine for Python on my laptop.

Luckily, `virtualenv` has solved the problem and now I’m able to login! Further, I’m also able to isolate the Python environment now, which allows me to do further investigations on the impact of Python versions on installing yamz. I’m going to explore install yamz on several different Python versions. Since Anaconda distributions are so common right now, I think it might worth it for me to test Anaconda Python and put the result in the new readme file. I’m curious about if the login problem was caused by Anaconda Python itself or the conflict between the default version of Python on my laptop and the Anaconda distribution I installed later. 


To learn more about `virtualenv`:

  • Virtualenv and why you should use virtual environments


  • Working Effectively with Python Virtual Environments (Virtualenv)




Week 3: Metadata – data about data


LEADS site: Repository Analytics & Metrics Portal


On the 3rd week, I worked on downloading the metadata from the institutional repository. We already prepared a script to download the metadata based on the RAMP dataset we want to analyze. However, because every period there will always be a new request to the different documents, we must download the unique URL requested to have complete metadata.
Besides gathering the metadata, I also tried to do some analysis on the metadata and the Institutional Repositories. From my observation, I found that each Institutional Repository commonly uses some metadata terms, and there are also some unique terms that are used by only some IRs.
On the weekly meeting, we gathered some ideas that we want to focus working on, and we will work on to tackle these research questions with the supervision of Prof. Arlitsch and Jonathan for the upcoming weeks.
Nikolaus Parulian

Week 3-4: Sonia Pascua, The Paper and the proposal

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading

In the past weeks, I was able to progress by co-authoring a paper with Jane Greenberg, Peter Logan and Joan Boone. We’re able to submit the paper entitled “SKOS of the 1910 Library of Congress Subject Heading for the Transformation of the Keywords to Controlled Vocabulary of the Nineteenth-Century Encyclopedia Britannica” to NKOS 2019 which will be held in Dublin Core Conference 2019 in South Korea on Sept 23 -26, 2019. We couldn’t wait the acceptance of the paper hoping that this research has a novelty in the field of Simple Knowledge Organization Systems (SKOS).

This paper was also the starter in discussing what could really be the approaches in the SKOS-ination of 1910 LCHS.
This week I met with my mentor Peter for our weekly cadence. Scope was clarified and nailed down in this meeting.The project aims to transform the digitized 1910 LCHS to SKOS. Peter had shared the text file of the digitized 1910 LCHS and we’re able to discuss what could be the possible approaches for me to be able to execute my task. I appreciated the expertise of my mentor in handling the project and a mentee, like me. He made an effort to synchronize the concepts between us. We dwelled on the appropriate understanding between “keyword” and “index term” which I believe is very critical in building a thesaurus in SKOS. As I have presented to him my plan of execution, below are the paces we looked at to achieve the goal of the project:
  • Digitized format of 1910 LCHS is converted to text format to help in the manipulation of texts and words. This has been done already by Peter. The 1910 LCSH in digitized format which was made available by Google under the HathiTrust project is composed of 2 volumes. In the text format (.docx), volume 1 is composed of 363 pages and volume 2 has 379 pages.
  • Vocabularies are assessed to identify the structures and relationships of the vocabularies in 1910 LCHS and be able to be mapped to the elements and syntax of the SKOS vocabulary. These elements and syntax have integrity conditions that are used as a guideline for best practices in constructing SKOS vocabularies. 
  • Processes, methods and methodology are documented and tested for reproducibility and replication purposes. The project will run for 10 weeks and it’s challenging to be able to complete the SKOS-ination of the entire 2 volumes of the 1910 LCHS. However, if the processes, tools, techniques and guides are available, the project could be continued and knowledge could be transferred to completely finish the SKOS of the 1910 LCHS.
  • Tools to be used in building the SKOS of the 1910 LCHS and in automating its creation processes, are seen to be one of the vital output of this endeavor. 
For the moment, I have started reading the W3C Semantic Web and ALA guides to understand the methodologies and methods is constructing SKOS. In the search of the tools, MultiTes with which MRC has acquired license, will be started to explored.
My personal desire is not only to SKOSify 1910 LCHS but also to document the processes in finding the appropriate approach, techniques and tools that could be used by and shared not only to Digital Scholarship Center but also to other entities of the same project goal and objective. SKOS is a representation that is readily consumed on the web and allows vocabulary creators to publish born-digital vocabularies on the web. [Frazier, 2013].
  1. Frazier, P. (2015, August 11). SKOS: A Guide for Information Professionals. Retrieved July 9, 2019, from http://www.ala.org/alcts/resources/z687/skos. Association for Library Collections and Technical Services, American Library Association
  2. HathiTrust: Home. (n.d.). Retrieved July 9, 2019, from www.hathitrust.org/. HathiTrust Digital Library
  3. Logan, P. (n.d.). Nineteenth-Century Knowledge Project. Retrieved July 9, 2019, from tu-plogan.github.io/. Digital Scholarship Center, Temple University
  4. SKOS Simple Knowledge Organization System – Home Page. (n.d.). Retrieved July 9, 2019, from https://www.w3.org/2004/02/skos/. Semantic Web Deployment Working Group, World Wide Web Consortium (W3C)

Jamillah Gabriel: Deep Diving into the Data

This past week has been spent delving into the datasets available to me in order to get a better sense of the lives of internees of the Japanese American interment camps, from entry to exit. What this means is that I’m looking at the entry data, exit data, and incident cards to glean a better understanding of life during this time. Some of the data that helps me in this endeavor are details about the first camp where a person entered the system, the assembly center they were taken to before getting to the camp, the date they first arrived at camp, other camps they may have been transferred to or from, the camp they last stayed at before exiting, their final departure date, the destination after their departure from the camp, birthdate, birthplace, and where they lived before internment (among many other details). The incident cards represent the recordkeeping system that includes details of various “offenses” that took place within the camps, and were typically only written up for people who violated rules in the camp, or in some cases, to keep records of deaths within the camp. Not every internee has incident cards, so there are silences and erasures within these archival records that might never be uncovered. But what one can do is gather up all of these details and possibly try to glean from them a narrative about the life of the internee imprisoned in these camps.

This is what I’m currently working on and I hope to share a little bit about select people in coming weeks. One of the most important things to consider is the sensitivity of these records as not all data can be publicly divulged at this point. NARA, the current steward of the records, has asked that we adhere to the restriction of 75 years when disclosing data. In other words, any records taking place after July 8, 1944 cannot be revealed. This is something I’ll have to keep in mind going forward in terms of how to best present the data in ways that both highlight and privilege the narratives and stories of the people unjustly imprisoned in these camps.



Week 3: Bridging NHM collection to Biodiversity Occurrence dataset – example of land snails

To recap what I was trying to do: I wanted to find a species in Taiwan that also happened to be mentioned in the Proceedings of Academy of Natural Sciences.
We chose Taiwan as our geographic point of interest because it has been historically complex in terms of sovereignty and will probably be interesting as an example to see shifting geopolitical realities.
This whole week I have been brushing up the use cases on the example we gathered from the Biodiversity Heritage Library — a land snail species “Pupinella swinhoei sec. H. Adams 1866″.
The idea is to bring the Natural History Museum literature (NHM) closer to real life biodiversity occurrence dataset. I then gathered the dataset from GBIF with search term on scientific name “Pupinella swinhoei”. The aggregated GBIF dataset contains 50 occurrence records across 18 institutions (18 datasets), ranging from year 1700 to now

(different colors indicate they are from different data source)
Though the ‘countryCode’ field mostly indicated that the records are from TW (Taiwan), it may not be the sovereignty at that time period. To merge these datasets with the sovereignty at the time, I examined two of the 18 data sources first: MCZ dataset versus NSSM dataset.  
The 1700 Taiwan is a county within the Qing Dynasty China.
And the 1930s Taiwan is a colonized region of Japan.
I had some preliminary results to merge these two dataset’s sovereignty field by using the logic-based taxonomy alignment approach. However, since I am preparing a submission for a conference based on this use case- I don’t want to jinx anything! (Fingers crossed).
If I am allowed to share more about the paper, I promise to discuss more in the next blog post!
Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica



Minh Pham, Week 2: Mapping data out with aesthetics and readability


In week 2, I focused on refining the visualizations I did in week 1 to better visualize and understand one dataset among the three large datasets (so far) we have in the project. Thanks to the visualizations, I have some sense of information seeking behaviors of users who use institutional repositories (IR) to search and download information including devices used, device differences due to geolocation, time of search, factors affecting their clicks and clickthroughs etc. 


To improve the aesthetics of the visualization, I paid attention to color contrast, graphic resolution, color ramp, transparency of colors, shapes, and scales of x and y axis. To enhance the readability of the visualization, I tried not to present too much information in one visual using Miller’s law of “The Magical Number Seven, Plus or Minus Two” to make sure that people will not feel overwhelmed when looking at the visual and processing information. 


Besides working with visualizing information which struck me as interesting in the first dataset, I also tried to wrangle the other datasets. Nikolaus managed to harvest metadata relevant to each URL. This means we can look into metadata content related to each search. However, it also creates a challenge for me regarding how to make unstructured string data into structured data. This is not what I often do but I am excited to brush up my skills in working with text data in the coming weeks.


Minh Pham


LEADS Blog #2 Deploying yamz on my machine!


Deploying Yamz on my machine!

Hanlin Zhang

July 3rd, 2019


Last week has been a tough week for me. I had been working closely with Bridget and John to set up a local yamz environment on my machine. Both John and Bridget are super helpful and very experienced in software developing and problem-solving. I asked John a question since started to read the readme document: what does the ‘xxx’ mark in the readme file stand for? I noticed a lot of ‘xxx’ marks in the Readme document of yamz.net (https://github.com/vphill/yamz), for instance, there are a couple of blocks start with the mark of ‘xxx’, such as:


xxx do this in a separate “local_deploy” dir?

xxx user = reader?


I was really interested in what does those line mean. Based on my experience with yamz, most of the lines started with ‘xxx’ is pretty useful and definitely something worth to read in the first place. John said in the world of software development, ‘xxx’ mark stands for problem waiting to be solved or comments so critical that should be paid attention to immediately. It seems my intuition was right but it is also confusing to those people without developing experience. We are going to rewrite the readme file in the summer to make it more reader-friendly. Meanwhile, I’m still debugging some error I’ve encountered while developing:






According to Margaret Rouse (see the link below), OAuth “allows an end user’s account information to be used by third-party services, such as Facebook, without exposing the user’s password”. The central idea of OAuth is to reduce the total number of times password is required in order to establish an identity, and instead to ask trusted parties to issue certificates for security and convenience concerns. But it also raises a question of to what extent we trust Google, Facebook or Twitter, and etc. as a gatekeeper for our personal identity? What is the price we are paying to use their service in lieu of money?  Will it stop at ‘we run ads’?


To read more:

  • What does XXX mean in a comment?


  • OAuth




Rongqian Ma: Week 2 – Visualizing complexities of places/locations in the manuscript data

During week 2 I started working with data that demonstrates the geolocation where the manuscripts were produced and used. Something I didn’t quite realize before I delved deep into the data is that they are not simply names of places, but geo-information represented in different formats and with different connotations. The variety of the geo-location data exists in the following aspects: a) missing data (i.e., N/A), b) different units presented in regions, nation-states, and cities, respectively; c) suspicious information (i.e., “?”) indicated in the original manuscripts, d) change of geographies over different historical periods, so being hard to visualize the inconsistency of geographies over time; e) single vs. multiple locations represented in one data entry. Facing this situation, I spent some time cleaning and reformatting data as well as thinking about strategies to visualize this part of the data. I merged all the city information with country/nation-states information and also conducted some search for old geographies such as Flanders (and found its complexities…). The geographies also transit during times, which is hard to present in one single visual. I created a pie chart that shows the proliferation and popularity of the book of hours in certain areas, and multiple bar charts showing the merged categories (e.g., city information, different sections of Flanders area). I also found a map of Europe during the middle ages (time period represented in the dataset) and add other information (e.g., percentage) to the map, which I think may be a more straightforward way to communicate the geographical distribution of the book of hours productions. As the geographical data are necessarily related to the temporal data and other data categories regarding the content and decorations of the manuscripts, for the next step I’m aiming to create more interactive visualizations that can connect different categories of the dataset. I’m excited to work with such complexities of the manuscripts data, which also reminded me of a relatively similar case I encountered before about Chinese manuscripts, where the date information was represented in various formats, especially in a combination of the old Chinese style and the western calendar style. Standardization might not always be a good way to communicate the ideas behind the data and to visualize the complexity is a challenge.

Week 2: Kai Li: It’s all about MARC

It’s been two very busy weeks since my last update. It has almost become a common sense that getting your hands dirty with data is the most important thing in any data science project. That is exactly what I have been doing in my project.

The scope of my project is one million records of books that are published in the US and UK since the mid-20th century. The dataset turns out to be even larger than I originally imagined. In the format of XML, the size of the final data is a little below 6 gigabytes, which is almost the largest dataset that I have ever used. As someone who has (very unfortunately) developed quite solid skills to parse XML data using R, the size of the file became the first major problem that I had to solve in this project: I could not load the whole XML file into R because it would exceed the limit of the string size that R allows (2 GB). But thanks to this limitation of R, I had the chance to re-learn about XML parsing in the environment of Python. By re-using some codes written by Vic last year, the new parser was developed without too much friction.

According to Karen Coyle (whom BTW, is one of my heroes in the world of library cataloging), the development of MARCXML represents how this (library cataloging) community missed the chance to fit its legacy data into the newer technological landscape (Coyle 2015, p. 53). She definitely got a point here: while MARCXML does an almost perfect job translating the countless MARC fields, subfield, and indicators into the structure of XML, it doesn’t do anything beyond that. It kept all the inconveniences of using MARC format, especially the disconnection between text and semantics, which is the reason why we had the publisher entity problem in the first place.


[A part of one MARC record]

Some practical problems also emerged from this characteristics of MARCXML. The first one is that data hosted in the XML format keeps all punctuations in the MARC records. The use of punctuations is required by the International Standard Bibliographic Description (ISBD), which was developed in the early 1970s (Gorman, 2014) and has been one of the most important cataloging principles in the MARC21 system. Punctuations in the bibliographic data mainly serve the needs of printed catalog users: they are said to help users to get more contexts about the information printed in the physical catalog (which no one is using today, if you noticed). Not surprisingly, this is a major source of problem for the machine-readability of library bibliographic data: different punctuations are supposed to be used when the same piece of data are used before different subfields within the same field, a context that is totally irrelevant to the data per se. One example about publisher statement is offered below, in which London and New York are followed by different punctuations because they are followed by different subfields:


[An example of a 260 field]

The second practical problem is the fact that a single semantic unit in the MARC format may contain one to many data values. This data structure makes it extremely difficult for machine to understand the meaning of the data. A notable example is the 24-27 digits in the 008 field ([https://www.loc.gov/marc/bibliographic/bd008b.html]). For book records, these digits represent what type of contents that the described resource is or contains. This semantic unit has 28 values that catalogers may use, including bibliographies, catalogs, et al. and for each record, up to four values can be assigned to the record. The problem is that, even though using a single value (such as “b”) can be very meaningful, it is much less so when values like “bcd” are used. In this case, this single data point in the MARC format has to be transformed into more than two dozen binary fields indicating whether a resource contains each type of content or not, so that the data can be meaningfully used for the next step.

While cleaning the MARC data can be quite challenging, it is still really fun for me to use my past skills to solve this problem and get new perspectives on what I did in the past.


Coyle, K. (2015). FRBR, before and after: a look at our bibliographic models. American Library Association.

Gorman, M. (2014). The Origins and Making of the ISBD: A Personal History, 1966–1978. Cataloging & Classification Quarterly, 52(8), 821–834. https://doi.org/10.1080/01639374.2014.929604


California Digital Library

California Digital Library – YAMZ (Week 2)
Bridget Disney
This week, I’ve been learning more about YAMZ. Going through the install process has been tedious but I have (barely) achieved a working instance. I was able to start the web server and display YAMZ on my localhost, and learned a bit in the process, so that was exciting!    
The difference is because I don’t have any data in my PostgreSQL database. Here’s were things get a little bit murky. To add a term, I have to log in to the system via Google. The login didn’t seem to be working so I changed some code to make it work on my local installation. However, it could be that the login was only intended for use with the Heroku (not local) system so what I really need to do is to somehow bypass the login when it runs on my computer. So it’s back to the drawing board.
Even when I do login successfully, I am getting error messages – still working on those! These messages look like they might have something to do with one of the subsystems that YAMZ uses.    
After going through all that, Hanlin and I had a very useful Zoom session with John Kunze, our mentor, and the plans have been adjusted slightly. The directions for using YAMZ are different now due to the fact that it’s been a few years and the versions of the software used have changed. Also, the free hosting server has limitations and needs to be moved from Heroku to Amazon’s AWS. As such, Hanlin and I are revising the directions in Google doc to document the new process.
John is working to get us direct access to the CDL server which requires us to VPN into our respective universities and then connect to the YAMZ servers. When that is all set up, we will work through the challenge of figuring out how to proceed to move code from development to production environments.
In the meantime, looking through the code I see there are also two Python components I need to get up to speed on – Flask (a micro framework for the user interface) and Django (a web framework for use with HTML).