Week 2: Kai Li: It’s all about MARC

It’s been two very busy weeks since my last update. It has almost become a common sense that getting your hands dirty with data is the most important thing in any data science project. That is exactly what I have been doing in my project.

The scope of my project is one million records of books that are published in the US and UK since the mid-20th century. The dataset turns out to be even larger than I originally imagined. In the format of XML, the size of the final data is a little below 6 gigabytes, which is almost the largest dataset that I have ever used. As someone who has (very unfortunately) developed quite solid skills to parse XML data using R, the size of the file became the first major problem that I had to solve in this project: I could not load the whole XML file into R because it would exceed the limit of the string size that R allows (2 GB). But thanks to this limitation of R, I had the chance to re-learn about XML parsing in the environment of Python. By re-using some codes written by Vic last year, the new parser was developed without too much friction.

According to Karen Coyle (whom BTW, is one of my heroes in the world of library cataloging), the development of MARCXML represents how this (library cataloging) community missed the chance to fit its legacy data into the newer technological landscape (Coyle 2015, p. 53). She definitely got a point here: while MARCXML does an almost perfect job translating the countless MARC fields, subfield, and indicators into the structure of XML, it doesn’t do anything beyond that. It kept all the inconveniences of using MARC format, especially the disconnection between text and semantics, which is the reason why we had the publisher entity problem in the first place.


[A part of one MARC record]

Some practical problems also emerged from this characteristics of MARCXML. The first one is that data hosted in the XML format keeps all punctuations in the MARC records. The use of punctuations is required by the International Standard Bibliographic Description (ISBD), which was developed in the early 1970s (Gorman, 2014) and has been one of the most important cataloging principles in the MARC21 system. Punctuations in the bibliographic data mainly serve the needs of printed catalog users: they are said to help users to get more contexts about the information printed in the physical catalog (which no one is using today, if you noticed). Not surprisingly, this is a major source of problem for the machine-readability of library bibliographic data: different punctuations are supposed to be used when the same piece of data are used before different subfields within the same field, a context that is totally irrelevant to the data per se. One example about publisher statement is offered below, in which London and New York are followed by different punctuations because they are followed by different subfields:


[An example of a 260 field]

The second practical problem is the fact that a single semantic unit in the MARC format may contain one to many data values. This data structure makes it extremely difficult for machine to understand the meaning of the data. A notable example is the 24-27 digits in the 008 field ([https://www.loc.gov/marc/bibliographic/bd008b.html]). For book records, these digits represent what type of contents that the described resource is or contains. This semantic unit has 28 values that catalogers may use, including bibliographies, catalogs, et al. and for each record, up to four values can be assigned to the record. The problem is that, even though using a single value (such as “b”) can be very meaningful, it is much less so when values like “bcd” are used. In this case, this single data point in the MARC format has to be transformed into more than two dozen binary fields indicating whether a resource contains each type of content or not, so that the data can be meaningfully used for the next step.

While cleaning the MARC data can be quite challenging, it is still really fun for me to use my past skills to solve this problem and get new perspectives on what I did in the past.


Coyle, K. (2015). FRBR, before and after: a look at our bibliographic models. American Library Association.

Gorman, M. (2014). The Origins and Making of the ISBD: A Personal History, 1966–1978. Cataloging & Classification Quarterly, 52(8), 821–834. https://doi.org/10.1080/01639374.2014.929604

3 thoughts on “Week 2: Kai Li: It’s all about MARC”

  1. Kai, What an interesting post! It seems to outline the challenges of library data that has been defined by pre-computer traditions. I’m looking forward to see how you work with some of these issues.


  2. Kai, I was reading this and a recent punctuation problem had come to mind. I had some text and wanted to extract the questions in the text. While it seemed like it would be easy at first, especially after getting the “R” tips for strings at bootcamp, it turned out to be harder than I imagined because of the other punctuation that can exist in sentences. I did find some examples in Python that seemed better suited that “R” for this task. I’m not as familiar with Python so I ended up pulling out the questions manually – fortunately my data was not as extensive as yours!

  3. Hi Bridget, sorry that I just saw your second message. I would say if your goal is just to remove all ending punctuations, it would be relatively easier using regex. But for me, the difficult part is to have accurate match between all subfields a and b. But please let me know if there is anything that I can help. (I will try to remember to send you an email later about this.)

Leave a Reply

Your email address will not be published. Required fields are marked *