Final post: Kai Li: Wrapping up of OCLC project

It’s been a really fast 10 weeks working with OCLC. While I missed quite a few blog posts, the work never stopped. This post will only serve as a summary of this project. I will write a few posts (potentially on my personal website) about more details of this project and some technical backgrounds behind the selections that we made.

In this project, we tried to apply network analysis and community detection methods to identify meaningful publisher clusters based on the ISBN publisher code they use. From my perspective, this unsupervised learning approach was selected because of a lack of baseline test conducted from a large-scale perspective, so that supervised approach using any real-world data is not possible.

In the end, we get yearly publisher clusters that hopefully reflects the relationship between publishers in a given year. That is being said, community detection methods is difficult to be combined with temporal considerations. The year may not be a fully meaningful unit to analyze how publishers are connected to each other (the relationship between any two publishers may well change in the middle of a given year), but we still hope this approach to publisher clusters could generate more granular results than using data in all years. The next step, though turned out to be much more substantial that what was expected, is to use manual approach to evaluate the results. And hopefully this project will be published in a near future.

Despite its limitations, I really learnt a lot from this project. This is the first time I have to play with library metadata in a really large scale. As almost my first project too large to be dealt with by R, I gained extensive experiences using Python to deal with XML data. And during the process, I also read a lot about the publishing industry, whose relationship with our project was proven to be more than significant.

The last point above is also one that I wish I better realized in the beginning of this project. The most challenging part of this project is not any technical issue, but the complexity of the reality that we aim to understand through data analysis. Publishers and imprints could mean very different things in different social and data contexts. And there are different approaches to clustering them with their own meanings underlying the clusters. My lack of appreciation of the importance of the real publishing industry prevented me from foreseeing the difficulties of evaluating the results. I think in a way, this could mean that field knowledge is fundamental to any algorithmic understanding of this topic (or other topics data scientists have to work on), and to a lesser extent, any automatic method is only secondary to the final solution to this question.  

Leave a Reply

Your email address will not be published. Required fields are marked *