LEADS Blog

Week 2: Kai Li: It’s all about MARC

It’s been two very busy weeks since my last update. It has almost become a common sense that getting your hands dirty with data is the most important thing in any data science project. That is exactly what I have been doing in my project.

The scope of my project is one million records of books that are published in the US and UK since the mid-20th century. The dataset turns out to be even larger than I originally imagined. In the format of XML, the size of the final data is a little below 6 gigabytes, which is almost the largest dataset that I have ever used. As someone who has (very unfortunately) developed quite solid skills to parse XML data using R, the size of the file became the first major problem that I had to solve in this project: I could not load the whole XML file into R because it would exceed the limit of the string size that R allows (2 GB). But thanks to this limitation of R, I had the chance to re-learn about XML parsing in the environment of Python. By re-using some codes written by Vic last year, the new parser was developed without too much friction.

According to Karen Coyle (whom BTW, is one of my heroes in the world of library cataloging), the development of MARCXML represents how this (library cataloging) community missed the chance to fit its legacy data into the newer technological landscape (Coyle 2015, p. 53). She definitely got a point here: while MARCXML does an almost perfect job translating the countless MARC fields, subfield, and indicators into the structure of XML, it doesn’t do anything beyond that. It kept all the inconveniences of using MARC format, especially the disconnection between text and semantics, which is the reason why we had the publisher entity problem in the first place.

blog2_pic1.jpg

[A part of one MARC record]

Some practical problems also emerged from this characteristics of MARCXML. The first one is that data hosted in the XML format keeps all punctuations in the MARC records. The use of punctuations is required by the International Standard Bibliographic Description (ISBD), which was developed in the early 1970s (Gorman, 2014) and has been one of the most important cataloging principles in the MARC21 system. Punctuations in the bibliographic data mainly serve the needs of printed catalog users: they are said to help users to get more contexts about the information printed in the physical catalog (which no one is using today, if you noticed). Not surprisingly, this is a major source of problem for the machine-readability of library bibliographic data: different punctuations are supposed to be used when the same piece of data are used before different subfields within the same field, a context that is totally irrelevant to the data per se. One example about publisher statement is offered below, in which London and New York are followed by different punctuations because they are followed by different subfields:

graph2_pic2.jpg

[An example of a 260 field]

The second practical problem is the fact that a single semantic unit in the MARC format may contain one to many data values. This data structure makes it extremely difficult for machine to understand the meaning of the data. A notable example is the 24-27 digits in the 008 field ([https://www.loc.gov/marc/bibliographic/bd008b.html]). For book records, these digits represent what type of contents that the described resource is or contains. This semantic unit has 28 values that catalogers may use, including bibliographies, catalogs, et al. and for each record, up to four values can be assigned to the record. The problem is that, even though using a single value (such as “b”) can be very meaningful, it is much less so when values like “bcd” are used. In this case, this single data point in the MARC format has to be transformed into more than two dozen binary fields indicating whether a resource contains each type of content or not, so that the data can be meaningfully used for the next step.

While cleaning the MARC data can be quite challenging, it is still really fun for me to use my past skills to solve this problem and get new perspectives on what I did in the past.

REFERENCES

Coyle, K. (2015). FRBR, before and after: a look at our bibliographic models. American Library Association.

Gorman, M. (2014). The Origins and Making of the ISBD: A Personal History, 1966–1978. Cataloging & Classification Quarterly, 52(8), 821–834. https://doi.org/10.1080/01639374.2014.929604

LEADS Blog

California Digital Library

California Digital Library – YAMZ (Week 2)
Bridget Disney
This week, I’ve been learning more about YAMZ. Going through the install process has been tedious but I have (barely) achieved a working instance. I was able to start the web server and display YAMZ on my localhost, and learned a bit in the process, so that was exciting!    
The difference is because I don’t have any data in my PostgreSQL database. Here’s were things get a little bit murky. To add a term, I have to log in to the system via Google. The login didn’t seem to be working so I changed some code to make it work on my local installation. However, it could be that the login was only intended for use with the Heroku (not local) system so what I really need to do is to somehow bypass the login when it runs on my computer. So it’s back to the drawing board.
Even when I do login successfully, I am getting error messages – still working on those! These messages look like they might have something to do with one of the subsystems that YAMZ uses.    
After going through all that, Hanlin and I had a very useful Zoom session with John Kunze, our mentor, and the plans have been adjusted slightly. The directions for using YAMZ are different now due to the fact that it’s been a few years and the versions of the software used have changed. Also, the free hosting server has limitations and needs to be moved from Heroku to Amazon’s AWS. As such, Hanlin and I are revising the directions in Google doc to document the new process.
John is working to get us direct access to the CDL server which requires us to VPN into our respective universities and then connect to the YAMZ servers. When that is all set up, we will work through the challenge of figuring out how to proceed to move code from development to production environments.
In the meantime, looking through the code I see there are also two Python components I need to get up to speed on – Flask (a micro framework for the user interface) and Django (a web framework for use with HTML).
LEADS Blog

Week 02 – Historical Society of Philadelphia

This week’s work could be defined by data gathering and meeting having. I handled a lot of logistics, such as creating a communication plan with Caroline Hayden, my mentor at the Historical Society of Pennsylvania (HSP). I was also able to discuss the project with last year’s Fellow, Karen Boyd, who gave me a great overview from her perspective. I’d previously viewed Karen’s lightning talk about her work at HSP, but being able to discuss what she did and what she thinks is a good next phase helped me figure out the scope for my own work on the project. Along with the coordination with Caroline and Karen, the LEADS Fellows had an online meeting where we discussed what we’ve been doing since leaving our boot camp in Philadelphia. I enjoyed hearing how other people’s work is progressing and am excited to begin the next stage of my own. 

Alyson Gamble
Doctoral Student, Simmons University
LEADS Blog

Week 2: Understanding the limitation of data – What we can’t do

LEADS site: Repository Analytics & Metrics Portal

 

 

After developing some visualization to understand the relationship between columns in the RAMP dataset, we had a follow-up meeting to discuss the visualization result.
The visualization I discussed on the meeting focuses on aggregation between categorical values in the ramp dataset including the number of visits for each index and each domain name (URL), number of visitors for citable and non-citable content, number of visits based on the user devices, and providing histogram for position, clicks, and clickThrough.
In the meeting, we also discussed the possibilities of incorporating external data such as metadata for each index. One of the mentors Jonathan have been trying to merge metadata to the older RAMP dataset period (2018), and we also can extract the metadata from the new dataset that we want to focus on analyzing.
What I will do next for this dataset is extracting metadata, make the data reacher so we can understand more about the behavior of the users through the metadata and form a research question that we want to focus on for the RAMP dataset.
Nikolaus Parulian

 

LEADS Blog

Jamillah Gabriel: From Relocation to Internment to Detention (and Everything in Between)

In the past couple of weeks, a flurry of articles have been published about concentration camps and their place in American society and history. My mentor shared them with me and I have found them useful in contextualizing my work with the Japanese American internment cards. I’m reminded of how my LEADS project and the data I’m working with are still relevant today, when concentrations camps can’t be relegated to the past and, in fact, are very much a reincarnated racist reality in the present. Three of the four articles sent to me (listed below) connect the history of Japanese American internment camps with current issues around the migrant detention camps that have been implemented to detain migrant children crossing the border from Mexico, and highlight the fact that this, unfortunately, is history repeating itself. For instance, Ft. Sill, which is now a migrant detention center, was founded in 1869 and was once “a relocation camp for Native Americans, a boarding school for Native children separated from their families, and an internment camp for 700 Japanese American men in 1942” (Hennessy-Fiske, 2019). Its unmitigated and irreconcilable history is a continued legacy of racial difference, segregation, and discrimination. All of the articles reinforce the importance of this project that I (and two other LEADS fellows before me) am working on, but the last piece written by the granddaughter of a survivor of the Japanese American incarcerations is truly the most motivating factor for this work: so that former internees and their family members can know their own histories.

 

 

References:

Friedman, M. (2019, June 19). American concentration camps: A history lesson for Liz Cheney. The Typescript. Retrieved from http://thetypescript.com/american-concentration-camps-a-history-lesson-for-liz-cheney

 Hennessey-Fiske, M. (2019, June 22). Japanese internment camp survivors protest Ft. Sill migrant detention center. Los Angeles Times. Retrieved from https://www.latimes.com/nation/la-na-japanese-internment-fort-sill-2019-story.html

 Provost, L. (2019, June 22). Prepared for arrest: Japanese-Americans protest at Fort Sill over incoming migrant children. The Duncan Banner. Retrieved from https://www.duncanbanner.com/news/prepared-for-arrest-japanese-americans-protest-at-fort-still-over/article_789070aa-9542-11e9-8107-9fcd6387dce9.html

 Sakurai, C. (2019, June 25). More than a name in the census: Piecing together the story of my grandmother’s life. National Japanese American Historical Society. Retrieved from https://www.facebook.com/notes/national-japanese-american-historical-society/more-than-a-name-in-the-census-piecing-together-the-story-of-my-grandmothers-lif/2679119588783598

 

Jamillah R. Gabriel, PhD Student, MLIS, MA
School of Information Sciences
University of Illinois at Urbana-Champaign
jrg3@illinois.edu

 

LEADS Blog

Week 2: Elaborating on our multi-level alignment idea and an initial exploration on the BHL collection

This week I explored more into the multi-level alignment idea , and I was almost convinced that we can leverage this idea into a ‘dataset merging’ problem.
The dataset merging idea is not new. For example in this one paper from my PhD advisor, they have discussed briefly about how to merge taxonomic data: Towards Best-effort merge on taxonomically organized data
But for our group in UIUC (in collaboration with systematic experts from ASU), we have mainly been working on the actually taxonomic names alignment rather than ‘dataset merging’.
For the dataset merging idea, our proposal is pretty simple.
If we can align taxonomic names, we should also be able to align other things in the dataset such as spatial information (in our case, countries/areas).
Naturally, finding the intersection from my project site the Academy of Natural Sciences and my interest in taxonomy has become the priority for this week. The task I have set for myself was to find a certain species that is endemic or popular across Taiwan (my geographical point of interest), and that also happens to appear somewhere in the text of either the proceedings or the journals of the Academy of Natural Sciences.
The quest went on with me fascinated (and slightly sidetracked) by all the orchids population and its varieties Taiwan has. To my surprise, one of the news (in Chinese) mentioned that Taiwan has more than 0.9 billions of moth orchids!
(image source:britannica.com)
Then I went on to create our dataset merging idea first around the orchids:
Basically, the idea is that if we have two occurrence datasets on orchids, then we can do the dataset merging with the two datasets like the figure shown above, with each column being one ‘taxonomy alignment problem’.
Just as I was almost set on going for the beautiful orchid flowers, I finally turned back to BHL to search the keyword “Taiwan” and set the Titles on “Academy of Natural Sciences”. This is when I found a whole new world of Mollusca (snails)!
The entry that returned results of intersection of “Taiwan” and “ANS” is from the Proceedings of Academy of Natural Sciences, v.57, 1905, and the title of the page/chapter is :“Catalogue of the Land and Fresh-water Mollusca of Taiwan (Formosa) with descriptions of new species”. 
Like the above BHL search interface shows, the scientific names on this page were also extracted and shown on the bottom left corner. Having this breakthrough on the Mollusca (possibly endemic to Taiwan), I will begin to work with this species on the dataset merging idea next week!
 
Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica

 

LEADS Blog

Rongqian Ma; Week 1-2: Getting familiar with the project and exploring the initial dataset

My LEADS fellowship placement is with the University of Pennsylvania Libraries, Digital Research Services. The project this year aims to visualize a digitized collection of book of hours manuscripts produced in middle ages Europe. The major idea behind the project is to better introduce and communicate this specific genre of book production to the audience, using visual forms and languages.

During the Drexel University boot camp between June 6-8, I took the best use of the time to visit the UPenn Library and had a meeting with my project mentor Ms. Dot Porter. We discussed the project goals and the major tasks to successfully deliver the project. We identified two possible ways to present and share our major project outcomes, one as a research paper and the other as an interactive website displaying and communicating the visualizations.

I spent the first week of LEADS project to get familiar with the “book of hours” as a genre and an artifact, reading secondary sources recommended by my mentor. By reading those materials I developed a better understanding of the book of hours in terms of its history, major characteristics, and uniqueness in the religious life of the middle ages, which has been helpful for me to think of ways to visualize the manuscript data. Week 2 was mostly utilized to browse the dataset and propose visualization strategies. The book of hours initial dataset contains information of 185 digitized manuscripts, including their dates of production, the provenance of production and circulation, the contents (i.e., passages of prayer), and the decorations. Thinking about the visualization strategies, my mentor and I had a Skype check-in and discussed issues regarding which types of visualizations and graphs to create and some potential problems involved in the visualization processes. I also reflected on the ideas and theories communicated in the information visualization session at the boot camp when trying to identify the most effective visualization strategies for the manuscript data. Following the discussion with my mentor, I started actually working with the initial dataset — the provenance data of manuscript productions in particular. As visualization goes on, I feel that each graph tends to be more complex than it appears and manuscripts data visualization is quite a craft.

LEADS Blog

Alyson Gamble: So it begins…

My name is Alyson Gamble and I’m a doctoral student from Simmons University. My placement in LEADS-4-NDP is with the Historical Society of Pennsylvania. Before the LEADS boot camp at Drexel, I was able to spent half a day at my host site. My mentor, Caroline Hayden, gave me a great tour of the HSP’s buildings and collections. I met other HSP employees who were part of the project last year. Being able to visit my host site in person was helpful to acquaint me with both the people and the collections. I’ll be focusing on historical public school records from Philadelphia.
IMG_20190605_1141190.jpg
Figure 1. Picture of the historical marker for the Historical Society of Pennsylvania
The boot camp itself was very informative. Since I was not familiar with all of the concepts we discussed, I made sure to remember that this was an educational opportunity and to recognize that I don’t (and won’t) know everything. I enjoyed the presenters’ lessons, especially ones with a strong data visualization component. From my past experience and research, I’ve learned how important visualizations are for making data understandable. With the right visualization, a person can gain insight into data that they wouldn’t otherwise notice. Right now, I’m very fond of The Pudding (https://pudding.cool/) for data journalism; one of my students from my previous life as a science librarian, Caitlyn Ralph, is one of the site’s stars and I adore all of the work that she and others do on the publication. My favorite data visualization tool is currently Tableau, which I learned a lot about from Jess Cohen-Tanugi at Harvard’s Lamont Library. It’s pretty easy to use and makes nice dataviz. I’m especially fond of the idea of using Tableau as a sandbox for determining what kind of visuals will work best for a data set before creating those visuals in another program like R.
IMG_20190608_122019_01.jpg
Figure 2. A picture of the final day of bootcamp
My favorite part of the bootcamp, though, was the opportunity to meet other doctoral students and the LEADS-4-NDP staff. Since I don’t get to interact with PhD students in person very often, it was a treat to spend time with the other LEADS Fellows. I’m very excited for our time together in this program and for seeing what happens with our projects.
IMG_20190606_0808010.jpg
Figure 3. Benches in the Drexel courtyard
LEADS Blog

Jessica Cheng, Week 1: Explore the direction of our project — Multi-layer taxonomy alignment problems

Week 1 of the LEADS fellowship project starts with exploring the actual problems and directions we would like to work for the course of 10 weeks.
For identifying the problems, I reviewed some literature that may hopefully guide me towards the intersection of my research interests and the Academy of Natural Science’s (ANS) goals. These topics are, but not exclusive to, taxonomies, knowledge graphs, biodiversity, geo-politics, and knowledge organization. We also want to link this project towards the Biodiversity Heritage Library (BHL) and ANS collections. 
Given the conversations I had with my mentor Steve Dilliplane at both the LEADS boot camp and the NASKO 2019 conference, I came up with this interesting idea of a ‘multi-layer’ taxonomy alignment problems/framework, which may ultimately guide us to constructing a data-driven biodiversity knowledge graph/ontology of a specific species we wish to examine.
Multi-layers:
A lot of times species co-occurrences datasets contain records based on the Darwin Core metadata standard. Multi-layers in this case means different fields in a co-occurrence dataset, these can include (again, not exclusive to): species names, characters/phenotypes, habitat information, geolocations (country, cities, latitude, longitude), IUCN redlist/other endangered speices classifications, etc. Depending on what type of metadata they actually use in the dataset, my thoughts are that each of these field can itself be a taxonomy.
 
1st taxonomy: species names
2nd taxonomy: geographic regions (geopolitical realities may exist) 
3rd taxonomy: phenotypes
…and more
How do we proceed? BHL/ANS or co-occurrences datasets & TAP:
Say we have two different datasets from BHL/ANS about Grizzly Bears. 
Each of these field can itself have a taxonomy alignment problem.
One dataset may only locate the Grizzly bear (species name identified by author X) in the lower 48 states, and lists the bears as endangered.
The other dataset may be the occurence dataset of the Grizzly bears (species name identified by author Y) in Alaska, which the bears are more than abundant.
1st taxonomy alignment problem (TAP): align the species names given by Author X vs. Author Y
2nd TAP: geographic regions – lower 48 states vs. Alaska
3rd TAP: endangered species list – one classification vs. another classification
In this case we can align the multi-layers in different datasets and each layers will come up with multiple possible worlds (merged solutions). 
Ontologies/knowledge graph/linked data:
If the abovementioned approach is feasible, my guess is that for each of the possible world we came up with, we can then patch them up together to form our own ‘grizzly bear’ ontologies/knowledge graphs. This can enable us to visualize and query for future uses.  
Questions: 
– Work with a particular species Academy of Natural Sciences is most proud of?
– What does the actual dataset look like? 
– Are there any relations across different layers?
10-week rough timeline:
Week1-2: identify the problems and research questions & come up with a 3-4 page proposal draft
Week 3-8: execute, implement the proposal 
Week 9-10: wrap up and draft deliverables 
 
———

 

Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica

 

LEADS Blog

Minh Pham, Week 1- Exploring the data

 

Week 1: Exploring the data

My placement is with the Repository Analytics & Metrics Portal (RAMP) project at Montana State University. Nikolaus – another LEAdS fellow in the same project with me provided a nice overview of the project. Thanks, Nikolaus!

 

Before the bootcamp, Nikolaus and I had an online meeting with our mentor – Dr. Kenning Atlitsch and other members in the project. Dr. Atlitsch and the other members in the project helped us understand more about the project and familiarized us with the data collected from the RAMP service. Thanks to the bootcamp, I came home filled with new knowledge about library science in general and meta data in particular and new techniques in database management, visualization, and analysis with text mining and machine learning methods.

 

For week 1, I focused on exploring the data by doing descriptive analysis and creating crude visualizations from the data. RAMP data consists numbers from over 50 IRs and consists over 400 million rows. Due to the amount of data and memory constraints of my laptop, it takes R from a couple of minutes to hours to run a command or knit the document. I looked into the option of working with R Studio Cloud but the current version of R Studio Cloud does not enable us to upload and work with such big data like RAMP. For now, I have to use the old school way of handing generated results from R: copying and pasting one by one to a word doc rather than make use of knitting capabilities of all results in a single document using R notebook or markdown.

 

My plan for the 2nd week is to refine the visualization for aesthetics and readability and merge RAMP data with other data to explore research possibilities from the RAMP data.

 

Minh Pham