Final post: Kai Li: Wrapping up of OCLC project

It’s been a really fast 10 weeks working with OCLC. While I missed quite a few blog posts, the work never stopped. This post will only serve as a summary of this project. I will write a few posts (potentially on my personal website) about more details of this project and some technical backgrounds behind the selections that we made.

In this project, we tried to apply network analysis and community detection methods to identify meaningful publisher clusters based on the ISBN publisher code they use. From my perspective, this unsupervised learning approach was selected because of a lack of baseline test conducted from a large-scale perspective, so that supervised approach using any real-world data is not possible.

In the end, we get yearly publisher clusters that hopefully reflects the relationship between publishers in a given year. That is being said, community detection methods is difficult to be combined with temporal considerations. The year may not be a fully meaningful unit to analyze how publishers are connected to each other (the relationship between any two publishers may well change in the middle of a given year), but we still hope this approach to publisher clusters could generate more granular results than using data in all years. The next step, though turned out to be much more substantial that what was expected, is to use manual approach to evaluate the results. And hopefully this project will be published in a near future.

Despite its limitations, I really learnt a lot from this project. This is the first time I have to play with library metadata in a really large scale. As almost my first project too large to be dealt with by R, I gained extensive experiences using Python to deal with XML data. And during the process, I also read a lot about the publishing industry, whose relationship with our project was proven to be more than significant.

The last point above is also one that I wish I better realized in the beginning of this project. The most challenging part of this project is not any technical issue, but the complexity of the reality that we aim to understand through data analysis. Publishers and imprints could mean very different things in different social and data contexts. And there are different approaches to clustering them with their own meanings underlying the clusters. My lack of appreciation of the importance of the real publishing industry prevented me from foreseeing the difficulties of evaluating the results. I think in a way, this could mean that field knowledge is fundamental to any algorithmic understanding of this topic (or other topics data scientists have to work on), and to a lesser extent, any automatic method is only secondary to the final solution to this question.  


Week 2: Kai Li: What I talk about when I talk about publishers

As I mentioned in my previous posts, the entitization of publishers was only recently problematized when the new BibFrame model was proposed, which treats publishers as a separate entity in the overall bibliographic universe, rather than a text string in the MARC record. However, from the perspective of cataloging, we still do not seem to know too much about what a publisher is.

In the MARC record, two fields, 260 and 264, are used for describing information about the publication, printing, distribution, issue, release, or production of the resource. The use of these two fields are different in the two cataloging rules, AACR2 (The Anglo-American Cataloging Rules, 2nd edition) and RDA (Resource Description and Access) that replaces AACR2. In the period of AACR2, all publisher and distributor information should be described in 260 subfield b, where multiple subfields can be used when there are more than one publishers/distributors. In the RDA rules, however, the 264 field should be used and different functions (primarily publication and distribution in the previous context) are distinguished by the second indicator of this field. One of the issues with the AACR2 rules is that it does not require publisher names to be transcribed just as what is displayed in the resource: catalogers have the freedom to omit or abbreviate some name components, such as “publishers” and “limited.” In certain ways, the RDA rules is more consistent with how publishers are supposed to be dealt with in a more modern information infrastructure: that publishers should be recorded in more consistent and structure manners and not mixing with other types of entities (especially distributors but also printers and issuers). But in practice, the majority of library bibliographic records were produced under AACRS rules, which are almost impossible to be transformed into RDA rules because we do not know what name components were omitted or abbreviated.

While how publisher names are described (inconsistently) in the MARC format is just one barrier to the identification of publishers that is relatively easy to solve, a real challenge in the present project is the history of the publishing industry. In the real-world context, what is described in 260/264 subfield b is just an imprint, which, by definition, is the unit that publishes, no matter what the unit is (it could be a publisher, or a brand or branch that is owned by the publisher, or an individual person that publishes the resource). For example, in this link, you can see all imprints that are owned by Penguin Random House, which BTW, was merged from Penguin Group and Random House in 2013, two of the largest publishers in the American publishing market.

Throughout the history of the publishing industry, publishers have been merging and splitting, just like the example of Penguin Random House. They might acquire a different publisher in total, or just some brands (imprints) owned by another publisher. And in some rare cases, an imprint was sold to a different publisher but was sold back to its original owner later. Shown below is a slice of data manually collected by Cecilia about the history of Wiley, another major publisher in America.

Screen Shot 2019-07-26 at 3.17.03 PM copy.jpg

[A slice of the history of Wiley]

From this imprint-centered view, a publisher is a higher level entity than imprints that includes all its child entities in a given time. In other words, quite unlike other bibliographic concepts, such as works (“great works are timeless”), publishers or imprints exist in a temporal framework. But this is a huge challenge for this project, partly because the idea of temporality is extremely difficult to be combined with network analysis methods. While I cannot give any solution at this time for this difficulty, this will be an interesting topic to be further addressed in my works.


Week 2: Kai Li: It’s all about MARC

It’s been two very busy weeks since my last update. It has almost become a common sense that getting your hands dirty with data is the most important thing in any data science project. That is exactly what I have been doing in my project.

The scope of my project is one million records of books that are published in the US and UK since the mid-20th century. The dataset turns out to be even larger than I originally imagined. In the format of XML, the size of the final data is a little below 6 gigabytes, which is almost the largest dataset that I have ever used. As someone who has (very unfortunately) developed quite solid skills to parse XML data using R, the size of the file became the first major problem that I had to solve in this project: I could not load the whole XML file into R because it would exceed the limit of the string size that R allows (2 GB). But thanks to this limitation of R, I had the chance to re-learn about XML parsing in the environment of Python. By re-using some codes written by Vic last year, the new parser was developed without too much friction.

According to Karen Coyle (whom BTW, is one of my heroes in the world of library cataloging), the development of MARCXML represents how this (library cataloging) community missed the chance to fit its legacy data into the newer technological landscape (Coyle 2015, p. 53). She definitely got a point here: while MARCXML does an almost perfect job translating the countless MARC fields, subfield, and indicators into the structure of XML, it doesn’t do anything beyond that. It kept all the inconveniences of using MARC format, especially the disconnection between text and semantics, which is the reason why we had the publisher entity problem in the first place.


[A part of one MARC record]

Some practical problems also emerged from this characteristics of MARCXML. The first one is that data hosted in the XML format keeps all punctuations in the MARC records. The use of punctuations is required by the International Standard Bibliographic Description (ISBD), which was developed in the early 1970s (Gorman, 2014) and has been one of the most important cataloging principles in the MARC21 system. Punctuations in the bibliographic data mainly serve the needs of printed catalog users: they are said to help users to get more contexts about the information printed in the physical catalog (which no one is using today, if you noticed). Not surprisingly, this is a major source of problem for the machine-readability of library bibliographic data: different punctuations are supposed to be used when the same piece of data are used before different subfields within the same field, a context that is totally irrelevant to the data per se. One example about publisher statement is offered below, in which London and New York are followed by different punctuations because they are followed by different subfields:


[An example of a 260 field]

The second practical problem is the fact that a single semantic unit in the MARC format may contain one to many data values. This data structure makes it extremely difficult for machine to understand the meaning of the data. A notable example is the 24-27 digits in the 008 field ([https://www.loc.gov/marc/bibliographic/bd008b.html]). For book records, these digits represent what type of contents that the described resource is or contains. This semantic unit has 28 values that catalogers may use, including bibliographies, catalogs, et al. and for each record, up to four values can be assigned to the record. The problem is that, even though using a single value (such as “b”) can be very meaningful, it is much less so when values like “bcd” are used. In this case, this single data point in the MARC format has to be transformed into more than two dozen binary fields indicating whether a resource contains each type of content or not, so that the data can be meaningfully used for the next step.

While cleaning the MARC data can be quite challenging, it is still really fun for me to use my past skills to solve this problem and get new perspectives on what I did in the past.


Coyle, K. (2015). FRBR, before and after: a look at our bibliographic models. American Library Association.

Gorman, M. (2014). The Origins and Making of the ISBD: A Personal History, 1966–1978. Cataloging & Classification Quarterly, 52(8), 821–834. https://doi.org/10.1080/01639374.2014.929604


Week 1: Kai Li: How did I get here?

I would like to imagine that I’ve had a quite “weird” career path. After getting an undergraduate degree in history, I became a library cataloger in a public library in China. And then because of my love for librarianship, I came to the US to get a Master’s degree in Library and Information Science and then this PhD degree in Information Science. After doing PhD, I gradually developed the dichotomy between being a professional librarian and being a researcher. I think a major difference is one’s epistemological stance: being a PhD means that you should be critical to all ideologies, including those embedded in your own business.

Long story short, all these seemingly not-so-related experience converged in my LEAD4 project: “Automatic Identification of Publisher Entities to Support Discovery and Navigation,” one that is sponsored by OCLC to use data science methods to disambiguate publisher entities recorded in the publication statements in library bibliographic metadata.

Interestingly enough, this project is not a totally new idea for me either. When I was still working at Ingram Content Group in 2014 (also as a cataloger) and was about to start my PhD program, Mrs. Cecilia Preston talked to me about this idea. That was a time when VIAF.org and ISNI were still relatively new projects and “entitization” (or name disambiguation) was a major interest in the library cataloging communities. In general terms, this has been a problem for library cataloging for many years because publisher names are only transcribed into unstandardized text strings, thus preventing the library data from being used in other meaningful ways. This argument, of course, was made in Mr. Roy Tennant’s very famous article, “MARC Must Die.”

I am very glad to get some updated knowledge about this movement from Dr. Jean Godby, my supervisor in this summer project. The entitization of publishers is still a major task faced by library cataloging communities because in the BIBFRAME (Bibliographic Framework) model (one that is to replace the MARC format), the publisher is treated as an entity. To be an entity, all publishers must be freed from the text strings, disambiguated, and assigned their own identifiers.

Screen Shot 2019-06-11 at 4.21.13 PM copy.jpg


So this is why I am here. I was super excited to read the project’s description when I decided to apply for the LEADS grant. And I am still super excited to spend the summer to immerse myself in the library bibliographic data to figure out how to extract and disambiguate publishers in the most effective way. This, I hope, will play a small role in making the library data more useful to all its “users.”