LEADS Blog

Rongqian Ma; Week 1-2: Getting familiar with the project and exploring the initial dataset

My LEADS fellowship placement is with the University of Pennsylvania Libraries, Digital Research Services. The project this year aims to visualize a digitized collection of book of hours manuscripts produced in middle ages Europe. The major idea behind the project is to better introduce and communicate this specific genre of book production to the audience, using visual forms and languages.

During the Drexel University boot camp between June 6-8, I took the best use of the time to visit the UPenn Library and had a meeting with my project mentor Ms. Dot Porter. We discussed the project goals and the major tasks to successfully deliver the project. We identified two possible ways to present and share our major project outcomes, one as a research paper and the other as an interactive website displaying and communicating the visualizations.

I spent the first week of LEADS project to get familiar with the “book of hours” as a genre and an artifact, reading secondary sources recommended by my mentor. By reading those materials I developed a better understanding of the book of hours in terms of its history, major characteristics, and uniqueness in the religious life of the middle ages, which has been helpful for me to think of ways to visualize the manuscript data. Week 2 was mostly utilized to browse the dataset and propose visualization strategies. The book of hours initial dataset contains information of 185 digitized manuscripts, including their dates of production, the provenance of production and circulation, the contents (i.e., passages of prayer), and the decorations. Thinking about the visualization strategies, my mentor and I had a Skype check-in and discussed issues regarding which types of visualizations and graphs to create and some potential problems involved in the visualization processes. I also reflected on the ideas and theories communicated in the information visualization session at the boot camp when trying to identify the most effective visualization strategies for the manuscript data. Following the discussion with my mentor, I started actually working with the initial dataset — the provenance data of manuscript productions in particular. As visualization goes on, I feel that each graph tends to be more complex than it appears and manuscripts data visualization is quite a craft.

LEADS Blog

Alyson Gamble: So it begins…

My name is Alyson Gamble and I’m a doctoral student from Simmons University. My placement in LEADS-4-NDP is with the Historical Society of Pennsylvania. Before the LEADS boot camp at Drexel, I was able to spent half a day at my host site. My mentor, Caroline Hayden, gave me a great tour of the HSP’s buildings and collections. I met other HSP employees who were part of the project last year. Being able to visit my host site in person was helpful to acquaint me with both the people and the collections. I’ll be focusing on historical public school records from Philadelphia.
IMG_20190605_1141190.jpg
Figure 1. Picture of the historical marker for the Historical Society of Pennsylvania
The boot camp itself was very informative. Since I was not familiar with all of the concepts we discussed, I made sure to remember that this was an educational opportunity and to recognize that I don’t (and won’t) know everything. I enjoyed the presenters’ lessons, especially ones with a strong data visualization component. From my past experience and research, I’ve learned how important visualizations are for making data understandable. With the right visualization, a person can gain insight into data that they wouldn’t otherwise notice. Right now, I’m very fond of The Pudding (https://pudding.cool/) for data journalism; one of my students from my previous life as a science librarian, Caitlyn Ralph, is one of the site’s stars and I adore all of the work that she and others do on the publication. My favorite data visualization tool is currently Tableau, which I learned a lot about from Jess Cohen-Tanugi at Harvard’s Lamont Library. It’s pretty easy to use and makes nice dataviz. I’m especially fond of the idea of using Tableau as a sandbox for determining what kind of visuals will work best for a data set before creating those visuals in another program like R.
IMG_20190608_122019_01.jpg
Figure 2. A picture of the final day of bootcamp
My favorite part of the bootcamp, though, was the opportunity to meet other doctoral students and the LEADS-4-NDP staff. Since I don’t get to interact with PhD students in person very often, it was a treat to spend time with the other LEADS Fellows. I’m very excited for our time together in this program and for seeing what happens with our projects.
IMG_20190606_0808010.jpg
Figure 3. Benches in the Drexel courtyard
LEADS Blog

Jessica Cheng, Week 1: Explore the direction of our project — Multi-layer taxonomy alignment problems

Week 1 of the LEADS fellowship project starts with exploring the actual problems and directions we would like to work for the course of 10 weeks.
For identifying the problems, I reviewed some literature that may hopefully guide me towards the intersection of my research interests and the Academy of Natural Science’s (ANS) goals. These topics are, but not exclusive to, taxonomies, knowledge graphs, biodiversity, geo-politics, and knowledge organization. We also want to link this project towards the Biodiversity Heritage Library (BHL) and ANS collections. 
Given the conversations I had with my mentor Steve Dilliplane at both the LEADS boot camp and the NASKO 2019 conference, I came up with this interesting idea of a ‘multi-layer’ taxonomy alignment problems/framework, which may ultimately guide us to constructing a data-driven biodiversity knowledge graph/ontology of a specific species we wish to examine.
Multi-layers:
A lot of times species co-occurrences datasets contain records based on the Darwin Core metadata standard. Multi-layers in this case means different fields in a co-occurrence dataset, these can include (again, not exclusive to): species names, characters/phenotypes, habitat information, geolocations (country, cities, latitude, longitude), IUCN redlist/other endangered speices classifications, etc. Depending on what type of metadata they actually use in the dataset, my thoughts are that each of these field can itself be a taxonomy.
 
1st taxonomy: species names
2nd taxonomy: geographic regions (geopolitical realities may exist) 
3rd taxonomy: phenotypes
…and more
How do we proceed? BHL/ANS or co-occurrences datasets & TAP:
Say we have two different datasets from BHL/ANS about Grizzly Bears. 
Each of these field can itself have a taxonomy alignment problem.
One dataset may only locate the Grizzly bear (species name identified by author X) in the lower 48 states, and lists the bears as endangered.
The other dataset may be the occurence dataset of the Grizzly bears (species name identified by author Y) in Alaska, which the bears are more than abundant.
1st taxonomy alignment problem (TAP): align the species names given by Author X vs. Author Y
2nd TAP: geographic regions – lower 48 states vs. Alaska
3rd TAP: endangered species list – one classification vs. another classification
In this case we can align the multi-layers in different datasets and each layers will come up with multiple possible worlds (merged solutions). 
Ontologies/knowledge graph/linked data:
If the abovementioned approach is feasible, my guess is that for each of the possible world we came up with, we can then patch them up together to form our own ‘grizzly bear’ ontologies/knowledge graphs. This can enable us to visualize and query for future uses.  
Questions: 
– Work with a particular species Academy of Natural Sciences is most proud of?
– What does the actual dataset look like? 
– Are there any relations across different layers?
10-week rough timeline:
Week1-2: identify the problems and research questions & come up with a 3-4 page proposal draft
Week 3-8: execute, implement the proposal 
Week 9-10: wrap up and draft deliverables 
 
———

 

Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica

 

LEADS Blog

Minh Pham, Week 1- Exploring the data

 

Week 1: Exploring the data

My placement is with the Repository Analytics & Metrics Portal (RAMP) project at Montana State University. Nikolaus – another LEAdS fellow in the same project with me provided a nice overview of the project. Thanks, Nikolaus!

 

Before the bootcamp, Nikolaus and I had an online meeting with our mentor – Dr. Kenning Atlitsch and other members in the project. Dr. Atlitsch and the other members in the project helped us understand more about the project and familiarized us with the data collected from the RAMP service. Thanks to the bootcamp, I came home filled with new knowledge about library science in general and meta data in particular and new techniques in database management, visualization, and analysis with text mining and machine learning methods.

 

For week 1, I focused on exploring the data by doing descriptive analysis and creating crude visualizations from the data. RAMP data consists numbers from over 50 IRs and consists over 400 million rows. Due to the amount of data and memory constraints of my laptop, it takes R from a couple of minutes to hours to run a command or knit the document. I looked into the option of working with R Studio Cloud but the current version of R Studio Cloud does not enable us to upload and work with such big data like RAMP. For now, I have to use the old school way of handing generated results from R: copying and pasting one by one to a word doc rather than make use of knitting capabilities of all results in a single document using R notebook or markdown.

 

My plan for the 2nd week is to refine the visualization for aesthetics and readability and merge RAMP data with other data to explore research possibilities from the RAMP data.

 

Minh Pham



LEADS Blog

Hanlin Zhang, LEADS Blog #1 Yamz Kickoff

 

Yamz Kickoff

June 23rd, 2019

 

In this summer, I’m going to work with my mentor John Kunze from California Digital Library (CDL), and another LEADS-4-NDP fellow Bridget Disney (University of Missouri), to do some awesome metadata research! What Jane Greenberg, John Kunze and other researchers in the area of metadata standards found problematic is that when metadata standard is being discussed and created, people (mostly domain experts) spend a relatively large amount of time to discuss and set the standards, controlled vocabularies and etc., but have little or less time to test the actual performance of such a standard and then revision.

 

YAMZ (Yet Another Metadata Zoo) creates a unique experience that is similar to Wikipedia and Stack Overflow in a scene that the community can co-edit and vote for a standard. Our first kickoff meeting with the LEADS-4-NDP site supervisor John was on Friday. We’ve learned that yamz.net is currently deployed on the free version of Heroku, and is going to be transferred to the Amazon cloud services (AWS) in this summer, and Bridget and I are going to be part of it. I’m very excited about we are going to be involved in this process and expecting to learn a lot of cool stuff.

 

To read more about Yamz:

http://www.yamz.net/about

 

The goals for next week:

  • Rewrite the new readme and improve the readability

  • Figure out how to remotely connect to CDL, preferably through a Drexel University Network.

 

 
Hanlin Zhang
LEADS Blog

Jamillah Gabriel: Getting Acquainted with the Data

For my LEADS project, I’ll be working with the Digital Curation Innovation Center (DCIC) at the University of Maryland on a project that examines Japanese American internment camp archival records that were collected over a period of four years from 1942 to 1946. I’m really excited to work on this project because of the cultural importance and potential impact it could have on the Japanese American community, which up to this point, has not had access to these records. The records consist of 25,000 cards that include details such as incidents in the camp, births and deaths, entries and exits, as well as transfers between camps.

 

After talking with my internship mentor, Richard Marciano, I decided to work on data that might help us track the movement of the internees within and among the camps from entry to exit in hopes that it might provide some insight into their lives. Additionally, examining data about the births and deaths in the camps could provide additional context that can aid in telling a more complete story of the Japanese American citizens who were subjected to imprisonment in internment camps. While the entire scope of the project has not been fleshed out completely, the preliminary steps of the research project will include parsing through three data files, looking at the previous projects conducted by MLIS students, reading the grant application which will allow the release of key data to the public, and viewing the “Resistance at Tule Lake” documentary. After these initial steps, I’ll begin to conceptualize what this data project will look like in terms of data processing and visualization.

 

I’m looking forward to what this project will bring to light in the remaining weeks of the internship!

 

Jamillah R. Gabriel

LEADS Blog

Nikolaus Parulian, Week 1: Exploratory Data Analysis – What we can do to understand the data?

LEADS site: Repository Analytics & Metrics Portal

 
 
After getting some ideas about data science, data analytics, and data visualization in the boot camp (Sonia already posted an excellent review on what we learn on the boot camp), I started working on the Repository Analytics and Metrics Portal (RAMP) dataset provided by my mentors. 
RAMP is a The Repository Analytics & Metrics Portal (RAMP) is a web service that improves the accuracy of institutional repository (IR) analytics. 
RAMP provides a persistent and accurate count of file downloads from IR and so much potential for IR metrics aggregation and comparison across the organization that join this project.
 
The first thing I did on the dataset is understanding the data by doing an exploratory data analysis. The RAMP dataset I am working on is derived from the Google Analytics Console  which contains page_clicks, URL, average_positions, and impressions merged with additional data that RAMP provided. I visualized and aggregated most of the categorical columns on the dataset and found the correlation between each numerical column. Besides that, I also count the statistics to see if there are outliers in the dataset.
 
In the end, I found some interesting result through the visualization and correlation analysis, and we will discuss the findings in the meeting on the second week.
 
Overall, this RAMP project is pretty exciting and have so many potentials. I am excited to continue working on this project further.
 
 
Nikolaus Parulian
LEADS Blog

Julaine Clunis, Week 1: Getting Started

Hi everyone!

This is Julaine and my assignment is with the Digital Public Library of America (DPLA). The DPLA has more than 3 million unique subject headings, with only a portion of those being from controlled vocabularies which can lead to various issues arising when records use slight term variations or synonyms for the same concept.
The aim of my project is to continue working on the development and testing of an effective method for analyzing record content and matching content. This includes keywords with relevant controlled terms from a defined list, in an effort to create a consistent vocabulary to aid users and that can be reliably re-ingested as well as consistently support analytics.
I have spent the last couple of days reading through a ton of documentation about the work that has already been completed on this project. Familiarizing myself with the DPLA Metadata Application Profile and getting set up and familiar with the software and data that has been recommended for use. I have been exploring, for the first time, Apache Spark and I am slowly finding my way around it (downloading, installing and setting up the environment for its use on my machine and reviewing tutorials),so I haven’t really done much in terms of coming up with any solutions to this problem as I am just getting to know the tools and the data.
My mentors have been incredibly supportive and helpful and make themselves available to me in several ways. I expect I will learn a lot from working with them and am feeling really thankful for that. We use various tools such as Slack, Zoom and email to stay in touch so I am feeling positive about having access to direction or support if and when I need it.
Well, that is about all I have to report at this time.
I wish everyone the best of luck going forward with their projects.

Julaine Clunis

LEADS Blog

Week 1: Bridget Disney blog entry

LEADS: Getting Started
Bridget Disney, California Digital Library
My LEAD project is at the California Digital Library (CDL), working with mentor John Kunze, and fellow participant Hanlin Zhang. On June 8th, the LEADS fellows attended a three day data science bootcamp in Philadelphia. It was a great opportunity to meet the LEADS staff and the other students. What an amazing group! I’m sure that we will learn from each other and collaborate on projects in the future. We learned a lot from the professors who introduced us to the basic concepts (in some depth) of data science. It was helpful to have a complete overview in everything from metadata to text processing to visualization.

 

LEADS-4-NDP Data Science Boot Camp
At the CDL, I’ll be working on YAMZ (http://yamz.net), which stands for Yet Another Metadata Zoo. The tagline on the web site bills itself as “A crowdsourced metadata dictionary. Search for terms, upvote useful ones.” This platform is used those developing and sharing controlled vocabularies. The software is written in Python using a PostgreSQL database.
I spent the first week hopelessly trying to feel my way around and setting up the environment for YAMZ. I have never used Python and am excited to get the chance to learn it. It looks like there are two choices of operating systems for this project – Mac and Ubuntu, a Unix like operating system that can run on a desktop. I elected to give the Mac a try. I started using a Macintosh two years ago, just to see how it worked and now I love it so much, there’s no turning back! However, while installing the components, I have run into a few obstacles. Hopefully, I’ll be able to work through those.
Perusing through the documentation, I see there is an article about scoring of meta dictionary terms (Patton, 2014, Community-based scoring of metadictionary terms) that might be helpful. Also, Hanlin sent me a link to get me started with GitHub (https://help.github.com/en/articles/connecting-to-github-with-ssh). So now I have some reading to do!
LEADS Blog

Week 1: Sonia Pascua, I am a LEADS-4-NDP 2019 Fellow

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading
                 As I am so privileged that I am one of the LEADS-4-NDP fellows for this year grant. My placement is with the Digital Scholarship Center of Temple University and my mentor is Peter Logan. Currently, we are at the project proposal stage and establishing proof of concept. We’re looking at a paper too to be one of our outputs which we target to submit to a conference like NKOS or Dublin Core.
                 As a fellow, I was included in the recent 3-day Data Science boot camp held at our University, Drexel University. As I posted it to LinkedIn, I was really excited to learn and to meet co-fellows in this boot camp. The days had gone by so quickly for this great endeavor. Nonetheless, I had a good account of my experience with this boot camp.
Day 1 was a full pack lecture and getting to know co-fellow and our respective projects. Our ice breaker was fantastic. It gave us the opportunity to know participants in a more fun way by asking a couple of questions to a partner then presented to everyone in the room what you’d found. It revealed exciting facts about co-fellow and broke rigidity amongst ourselves. From that moment on I felt comfortable with everyone.  
Lectures on Intro to Data Science by Prof. Erjia and Big Data Management by Prof. Il-Yeong, both from CCI were inspiring especially when they shared their own comprehension of concepts. I liked how Prof Erjia started with “A hundred people will have a hundred definitions of Data science (DS)…” which gave the right understanding on why there’s different treatment experienced in the DS field. I liked too how he drilled on the multidisciplinary skills needed by a modern data scientist and coached us that we should be getting just one skill and be good at it; that it would be hard to work on all four skillsets (Mathematics and Statistics, Programming and Databases, Domain Knowledge and Soft skills, Communications and Visualizations) and be the jack of all trades to them. This may end you up master of none which is not fruitful for a career. As an academic researcher, it’s advisable to boast of one skill and be a good part of a team in a DS endeavor. I appreciated Prof Erjia’s list of biases which I believe if understood, could be keys to overcoming challenges encountered DS.
On the other note, Prof Il-Yeong did expose a lot of compendium account of what happened through time in the database field. His story of “Old SQL to NO SQL to New SQL” was awesome. It provided an understanding of what we have now. It’s also great experiencing validation of what I was teaching. Hearing the database from an “antiqua” person. Don’t get me wrong. For me, “antiqua” term is full of respect and admiration. In my 10 years of teaching database, only a handful of people whom I regard as knowledgeable of the heart and soul of database and Il-Yeong is one of them.
Data Science talk of one of the mentors, Dr. Jean Godby, a senior research scientist at OCLC, was precious. She laid a good perspective to understand data science challenges and promises.
That day ended with our group dinner at Han Dynasty. We were joined by the Department Head of CCI Drexel University, Dr. Xia and Dr. Michelle Rogers and Dr. Peter Logan, one of the mentors of the LEADS-4-NDP Project and the director of Digital Scholarship Center which is my placement.
Day 2 as well as day 3, I should say were another stretches of lectures together with workshop in R. We got our hands dirty with the coding and building of our tech skill in the basics of R. Various topics ran from data pre-processing, data visualization and visual analytics, data mining and machine learning II to text processing and mini-workshop on BigML, a code-free tool for Automated Data Analytics. Dr. Richard Marciano did a small Data Science talk and presented the projects he and Digital Curation Innovation Center (DCIC) were working on. Additionally, Dr. Jane Greenberg delivered her presentation on metadata, data quality, and metadata integration.
I will miss the fellows. We had not gotten much time to really get to know each other but by heart, they are colleague and cohorts whom I can work with in this research journey of my life. I wish all of our successes in all our projects. Looking forward to our virtual meeting because we’re all working in Summer but from different states. How I wish we got time for bonding and trips.