I would like to imagine that I’ve had a quite “weird” career path. After getting an undergraduate degree in history, I became a library cataloger in a public library in China. And then because of my love for librarianship, I came to the US to get a Master’s degree in Library and Information Science and then this PhD degree in Information Science. After doing PhD, I gradually developed the dichotomy between being a professional librarian and being a researcher. I think a major difference is one’s epistemological stance: being a PhD means that you should be critical to all ideologies, including those embedded in your own business.
Long story short, all these seemingly not-so-related experience converged in my LEAD4 project: “Automatic Identification of Publisher Entities to Support Discovery and Navigation,” one that is sponsored by OCLC to use data science methods to disambiguate publisher entities recorded in the publication statements in library bibliographic metadata.
Interestingly enough, this project is not a totally new idea for me either. When I was still working at Ingram Content Group in 2014 (also as a cataloger) and was about to start my PhD program, Mrs. Cecilia Preston talked to me about this idea. That was a time when VIAF.org and ISNI were still relatively new projects and “entitization” (or name disambiguation) was a major interest in the library cataloging communities. In general terms, this has been a problem for library cataloging for many years because publisher names are only transcribed into unstandardized text strings, thus preventing the library data from being used in other meaningful ways. This argument, of course, was made in Mr. Roy Tennant’s very famous article, “MARC Must Die.”
I am very glad to get some updated knowledge about this movement from Dr. Jean Godby, my supervisor in this summer project. The entitization of publishers is still a major task faced by library cataloging communities because in the BIBFRAME (Bibliographic Framework) model (one that is to replace the MARC format), the publisher is treated as an entity. To be an entity, all publishers must be freed from the text strings, disambiguated, and assigned their own identifiers.
So this is why I am here. I was super excited to read the project’s description when I decided to apply for the LEADS grant. And I am still super excited to spend the summer to immerse myself in the library bibliographic data to figure out how to extract and disambiguate publishers in the most effective way. This, I hope, will play a small role in making the library data more useful to all its “users.”