LEADS Blog

Hanlin Zhang, LEADS Blog #1 Yamz Kickoff

 

Yamz Kickoff

June 23rd, 2019

 

In this summer, I’m going to work with my mentor John Kunze from California Digital Library (CDL), and another LEADS-4-NDP fellow Bridget Disney (University of Missouri), to do some awesome metadata research! What Jane Greenberg, John Kunze and other researchers in the area of metadata standards found problematic is that when metadata standard is being discussed and created, people (mostly domain experts) spend a relatively large amount of time to discuss and set the standards, controlled vocabularies and etc., but have little or less time to test the actual performance of such a standard and then revision.

 

YAMZ (Yet Another Metadata Zoo) creates a unique experience that is similar to Wikipedia and Stack Overflow in a scene that the community can co-edit and vote for a standard. Our first kickoff meeting with the LEADS-4-NDP site supervisor John was on Friday. We’ve learned that yamz.net is currently deployed on the free version of Heroku, and is going to be transferred to the Amazon cloud services (AWS) in this summer, and Bridget and I are going to be part of it. I’m very excited about we are going to be involved in this process and expecting to learn a lot of cool stuff.

 

To read more about Yamz:

http://www.yamz.net/about

 

The goals for next week:

  • Rewrite the new readme and improve the readability

  • Figure out how to remotely connect to CDL, preferably through a Drexel University Network.

 

 
Hanlin Zhang
LEADS Blog

Jamillah Gabriel: Getting Acquainted with the Data

For my LEADS project, I’ll be working with the Digital Curation Innovation Center (DCIC) at the University of Maryland on a project that examines Japanese American internment camp archival records that were collected over a period of four years from 1942 to 1946. I’m really excited to work on this project because of the cultural importance and potential impact it could have on the Japanese American community, which up to this point, has not had access to these records. The records consist of 25,000 cards that include details such as incidents in the camp, births and deaths, entries and exits, as well as transfers between camps.

 

After talking with my internship mentor, Richard Marciano, I decided to work on data that might help us track the movement of the internees within and among the camps from entry to exit in hopes that it might provide some insight into their lives. Additionally, examining data about the births and deaths in the camps could provide additional context that can aid in telling a more complete story of the Japanese American citizens who were subjected to imprisonment in internment camps. While the entire scope of the project has not been fleshed out completely, the preliminary steps of the research project will include parsing through three data files, looking at the previous projects conducted by MLIS students, reading the grant application which will allow the release of key data to the public, and viewing the “Resistance at Tule Lake” documentary. After these initial steps, I’ll begin to conceptualize what this data project will look like in terms of data processing and visualization.

 

I’m looking forward to what this project will bring to light in the remaining weeks of the internship!

 

Jamillah R. Gabriel

LEADS Blog

Nikolaus Parulian, Week 1: Exploratory Data Analysis – What we can do to understand the data?

LEADS site: Repository Analytics & Metrics Portal

 
 
After getting some ideas about data science, data analytics, and data visualization in the boot camp (Sonia already posted an excellent review on what we learn on the boot camp), I started working on the Repository Analytics and Metrics Portal (RAMP) dataset provided by my mentors. 
RAMP is a The Repository Analytics & Metrics Portal (RAMP) is a web service that improves the accuracy of institutional repository (IR) analytics. 
RAMP provides a persistent and accurate count of file downloads from IR and so much potential for IR metrics aggregation and comparison across the organization that join this project.
 
The first thing I did on the dataset is understanding the data by doing an exploratory data analysis. The RAMP dataset I am working on is derived from the Google Analytics Console  which contains page_clicks, URL, average_positions, and impressions merged with additional data that RAMP provided. I visualized and aggregated most of the categorical columns on the dataset and found the correlation between each numerical column. Besides that, I also count the statistics to see if there are outliers in the dataset.
 
In the end, I found some interesting result through the visualization and correlation analysis, and we will discuss the findings in the meeting on the second week.
 
Overall, this RAMP project is pretty exciting and have so many potentials. I am excited to continue working on this project further.
 
 
Nikolaus Parulian
LEADS Blog

Julaine Clunis, Week 1: Getting Started

Hi everyone!

This is Julaine and my assignment is with the Digital Public Library of America (DPLA). The DPLA has more than 3 million unique subject headings, with only a portion of those being from controlled vocabularies which can lead to various issues arising when records use slight term variations or synonyms for the same concept.
The aim of my project is to continue working on the development and testing of an effective method for analyzing record content and matching content. This includes keywords with relevant controlled terms from a defined list, in an effort to create a consistent vocabulary to aid users and that can be reliably re-ingested as well as consistently support analytics.
I have spent the last couple of days reading through a ton of documentation about the work that has already been completed on this project. Familiarizing myself with the DPLA Metadata Application Profile and getting set up and familiar with the software and data that has been recommended for use. I have been exploring, for the first time, Apache Spark and I am slowly finding my way around it (downloading, installing and setting up the environment for its use on my machine and reviewing tutorials),so I haven’t really done much in terms of coming up with any solutions to this problem as I am just getting to know the tools and the data.
My mentors have been incredibly supportive and helpful and make themselves available to me in several ways. I expect I will learn a lot from working with them and am feeling really thankful for that. We use various tools such as Slack, Zoom and email to stay in touch so I am feeling positive about having access to direction or support if and when I need it.
Well, that is about all I have to report at this time.
I wish everyone the best of luck going forward with their projects.

Julaine Clunis

News & Events

CCI Presents at ICHI 2019 in Xi’an, China

Drexel CCI participated in the 7th IEEE International Conference on Healthcare Informatics (ICHI 2019) in Xi’an, China, from June 10-13th. CCI professor Chris Yang served as the general co-chair and panelist for the conference.

Phd students Ou Stella Liang and Michal Monselise presented their full paper, “Identifying Important Risk Factors Associated with Vehicle Injuries using Driving Behavior Data and Predictive Analytics.” The paper was co-authored with Chris Yang. Ou Stella also presented a data analytics challenges paper co-authored with Ali Jazayeri and Chris Yang, entitled, “Interpatient Similarity-based Imputation of Missing Data in Electronic Health Records.”

Ou Stella participated in the doctoral consortium with her presentation, “Determining Safe Prescription Practices for Pregnant Women.”

 

LEADS Blog

Week 1: Bridget Disney blog entry

LEADS: Getting Started
Bridget Disney, California Digital Library
My LEAD project is at the California Digital Library (CDL), working with mentor John Kunze, and fellow participant Hanlin Zhang. On June 8th, the LEADS fellows attended a three day data science bootcamp in Philadelphia. It was a great opportunity to meet the LEADS staff and the other students. What an amazing group! I’m sure that we will learn from each other and collaborate on projects in the future. We learned a lot from the professors who introduced us to the basic concepts (in some depth) of data science. It was helpful to have a complete overview in everything from metadata to text processing to visualization.

 

LEADS-4-NDP Data Science Boot Camp
At the CDL, I’ll be working on YAMZ (http://yamz.net), which stands for Yet Another Metadata Zoo. The tagline on the web site bills itself as “A crowdsourced metadata dictionary. Search for terms, upvote useful ones.” This platform is used those developing and sharing controlled vocabularies. The software is written in Python using a PostgreSQL database.
I spent the first week hopelessly trying to feel my way around and setting up the environment for YAMZ. I have never used Python and am excited to get the chance to learn it. It looks like there are two choices of operating systems for this project – Mac and Ubuntu, a Unix like operating system that can run on a desktop. I elected to give the Mac a try. I started using a Macintosh two years ago, just to see how it worked and now I love it so much, there’s no turning back! However, while installing the components, I have run into a few obstacles. Hopefully, I’ll be able to work through those.
Perusing through the documentation, I see there is an article about scoring of meta dictionary terms (Patton, 2014, Community-based scoring of metadictionary terms) that might be helpful. Also, Hanlin sent me a link to get me started with GitHub (https://help.github.com/en/articles/connecting-to-github-with-ssh). So now I have some reading to do!
News & Events

MRC Hosts NASKO 2019

The Metadata Research Center hosted the North American Symposium on Knowledge Organization (NASKO 2019) from June 13-14.

Sam Grabus, Jane Greenberg, Sonia Pascua, Deborah Garwood NASKO
NASKO 2019 Participants: MRC’s Sam Grabus, Jane Greenberg, Sonia Pascua, and Deborah Garwood.

MRC Phd Student Sam Grabus presented her paper, “Representing Aboutness: Automatically Indexing 19th-Century Encyclopedia Britannica Entries.” The presentation discussed topic relevance revaluation for automatic indexing results, evaluating which of three keyword extraction algorithms produce more relevant results for the digital collection.

Sam Grabus presenting at NASKO 2019
Sam Grabus presenting at NASKO 2019
LEADS Blog

Week 1: Kai Li: How did I get here?

I would like to imagine that I’ve had a quite “weird” career path. After getting an undergraduate degree in history, I became a library cataloger in a public library in China. And then because of my love for librarianship, I came to the US to get a Master’s degree in Library and Information Science and then this PhD degree in Information Science. After doing PhD, I gradually developed the dichotomy between being a professional librarian and being a researcher. I think a major difference is one’s epistemological stance: being a PhD means that you should be critical to all ideologies, including those embedded in your own business.

Long story short, all these seemingly not-so-related experience converged in my LEAD4 project: “Automatic Identification of Publisher Entities to Support Discovery and Navigation,” one that is sponsored by OCLC to use data science methods to disambiguate publisher entities recorded in the publication statements in library bibliographic metadata.

Interestingly enough, this project is not a totally new idea for me either. When I was still working at Ingram Content Group in 2014 (also as a cataloger) and was about to start my PhD program, Mrs. Cecilia Preston talked to me about this idea. That was a time when VIAF.org and ISNI were still relatively new projects and “entitization” (or name disambiguation) was a major interest in the library cataloging communities. In general terms, this has been a problem for library cataloging for many years because publisher names are only transcribed into unstandardized text strings, thus preventing the library data from being used in other meaningful ways. This argument, of course, was made in Mr. Roy Tennant’s very famous article, “MARC Must Die.”

I am very glad to get some updated knowledge about this movement from Dr. Jean Godby, my supervisor in this summer project. The entitization of publishers is still a major task faced by library cataloging communities because in the BIBFRAME (Bibliographic Framework) model (one that is to replace the MARC format), the publisher is treated as an entity. To be an entity, all publishers must be freed from the text strings, disambiguated, and assigned their own identifiers.

Screen Shot 2019-06-11 at 4.21.13 PM copy.jpg

 

So this is why I am here. I was super excited to read the project’s description when I decided to apply for the LEADS grant. And I am still super excited to spend the summer to immerse myself in the library bibliographic data to figure out how to extract and disambiguate publishers in the most effective way. This, I hope, will play a small role in making the library data more useful to all its “users.”

News & Events

MRC Co-Sponsors NASKO this week: NASKO Highlights

The Metadata Research Center is co-hosting the North American Symposium on Knowledge Organization (NASKO 2019) from June 13-14th, at the College of Computing and Informatics.

Howard White: “On Patrick Wilson”

Professor Emeritus and Visiting Research Professor Howard White will deliver a special presentation at NASKO, titled “On Patrick Wilson.” Read more about Howard here.

Continue reading “MRC Co-Sponsors NASKO this week: NASKO Highlights”

News & Events

Metadata Mixer: “Metadata Madness”

TOPIC: Metadata Madness – accomplishments for the year, and/or goals for the summer.
Presenters: CCI PhD students, Cecilia Preston
Date:
Wednesday, June 12th
Time: 12:30-1:30 PM
Location: 3675 Market Street,
University City Science Center,
CCI’s new location
Room: Dean’s conference room is #1039 (10th floor)

ADDED FUN: A visit to the Metadata Research Center, now residing on the 11th floor of 3675 Market Street, joining AI (artificial intelligence) and data science [This is for guests outside CCI who may attend].